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Preface 


This textbook is aimed at graduate students and upper level undergraduates 
in mathematics, engineering, and computer science. The material and the ap- 
proach of the text were developed over several years at Auburn University in two 
independent courses, Information Theory and Data Compression. Although the 
material in the two courses is related, we think it unwise for information theory 
to be a prerequisite for data compression, and have written the data compression 
section of the text so that it can be read by or presented to students with no prior 
knowledge of information theory. There are references in the data compression 
part to results and proofs in the information theory part of the text, and those 
who are interested may browse over those references, but it is not absolutely 
necessary to do so. In fact, perhaps the best pedagogical order of approach to 
these subjects is the reverse of the apparent logical order: students will come 
to information theory curious and better prepared for having seen some of the 
definitions and theorems of that subject playing a role in data compression. 

Our main aim in the data compression part of the text, as well as in the 
course it grew from, is to acquaint the students with a number of significant 
lossless compression techniques, and to discuss two lossy compression meth- 
ods. Our aim is for the students to emerge competent in and broadly conversant 
with a large range of techniques. We have striven for a “practical” style of 
presentation: here is what you do and here is what it is good for. Nonethe- 
less, proofs are provided, sometimes in the text, sometimes in the exercises, so 
that the instructor can have the option of emphasizing the mathematics of data 
compression to some degree. 

Information theory is of a more theoretical nature than data compression. 
It provides a vocabulary and a certain abstraction that can bring the power of 
simplification to many different situations. We thought it reasonable to treat it 
as a mathematical theory and to present the fundamental definitions and ele- 
mentary results of that theory in utter abstraction from the particular problems 
of communication through noisy channels, which inspired the theory in the first 
place. We bring the theory to bear on noisy channels in Chapters 3 and 4. 

The treatment of information theory given here is extremely elementary. 
The channels are memoryless and discrete, and the sources are all “zeroth- 
order,” one-state sources (although more complicated source models are dis- 
cussed in Chapter 7). We feel that this elementary approach is appropriate for 
the target audience, and that, by leaving more complicated sources and channels 
out of the picture, we more effectively impart the grasp of Information Theory 
that we hope our students will take with them. 

The exercises range from the routine to somewhat lengthy problems that 
introduce additional material or establish more difficult results. An asterisk by 
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an exercise or section indicates that the material is off the main road, so to speak, 
and might reasonably be skipped. In the case of exercises, it may also indicate 
that the problem is hard and/or unusual. 

In the data compression portion of the book, a number of projects require 
the use of a computer. Appendix A documents Octave and Matlab scripts writ- 
ten by the authors that can be used on some of the exercises and projects involv- 
ing transform methods and images, and that can also serve as building blocks 
for other explorations. The software can be obtained from the authors’ site, 
listed in Appendix C. In addition, the site contains information about the book, 
an online version of Appendix A, and links to other sites of interest. 


Organization 


Here’s a brief synopsis of each chapter and appendix. 


Chapter 1 contains an introduction to the language and results of probability 
theory. 


Chapter 2 presents the elementary definitions of information theory, a justifi- 
cation of the quantification of information on which the theory is based, 
and the fundamental relations among various sorts of information and 
entropy. 


Chapter 3 is about information flow through discrete memoryless noisy chan- 
nels. 


Chapter 4 is about coding text from a discrete source, transmitting the en- 
coded text through a discrete memoryless noisy channel, and decoding 
the output. The “classical” fundamental theorems of information theory, 
including the Noisy Channel Theorem, appear in this chapter. 


Chapter 5 begins the material of the data compression portion of this book. 
Replacement schemes are discussed and the chapter concludes with the 
Noiseless Coding Theorem, proved here for a binary code alphabet. (It 
appears in Chapter 4 in more general form.) 


Chapter 6 discusses arithmetic coding, which is of considerable interest since 
it is optimal in a certain way that the replacement schemes are not. Con- 
siderations for both an “ideal” scheme and for practical implementation 
on a computer are presented. 


Chapter 7 focuses on the modeling aspects of Chapters 5 and 6 (Chapter 8 
continues the discussion). Since coding methods such as those presented 
in Chapter 6 can (in theory) produce optimal-length output for a given 
model of the source, much of the interest in improving compression in 
Statistical schemes lies in improving the model of the source. Higher- 
order models attempt to use larger contexts for predictions. In the second 
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edition, a section on probabilistic finite state source automata has been 
added. 


Chapter 8 considers another approach to modeling, using statistics that are 
updated as the source is read and encoded. These have the advantage that 
no statistical study needs to be done in advance and the scheme can also 
detect changes in the nature of the source. 


Chapter 9 discusses popular dictionary methods. These have been widely 
used, in part due to their simplicity, speed, and relatively good compres- 
sion. Applications such as Ross Williams’ LZRW1 algorithm, Unix com- 
press, and GNU zip (gzip) are examined. 


Chapter 10 develops the Fourier, cosine, and wavelet transforms, and dis- 
cusses their use in compression of signals or images. The lossy scheme 
in JPEG is presented as a widely-used standard that relies on transform 
techniques. The chapter concludes with an introduction to wavelet-based 
compression. 


Appendix A documents the use of the “JPEGtool” collection of Octave and 
Matlab scripts in understanding JPEG-like image compression. 


Appendix B contains the source listing for Ross Williams’ LZRW1-A algo- 
rithm, which rather concisely illustrates a viable dictionary compression 
method. 


Appendix C contains material that didn’t fit elsewhere. The first section lists 
sources for information and code for many areas of data compression. 
The second section contains a few notes on patents affecting the field. 
The final section contains a semi-famous story illustrating some of the 
misunderstandings about compression. 


Appendix D offers solutions and notes on the exercises. 
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Chapter 1 


Elementary Probability 


1.1 Introduction 


Definition A finite probability space is a pair (S, P), in which S is a finite 
non-empty set and P : S — [0, 1] is a function satisfying }°,-.5 P(s) = 1. 


When the space (S, P) is fixed in the discussion, we will call S the set of 
outcomes of the space (or of the experiment with which the space is associated — 
see the discussion below), and P the probability assignment to the (set of) out- 
comes. 

The “real” situations we are concerned with consist of an action, or experi- 
ment, with a finite number of mutually exclusive possible outcomes. For a given 
action, there may be many different ways of listing the possible outcomes, but 
all acceptable lists of possible outcomes satisfy this test: whatever happens, it 
shall be the case that one and only one of the listed outcomes will have occurred. 

For instance, suppose that the experiment consists of someone jumping out 
of a plane somewhere over Ohio (parachute optional). Assuming that there is 
some way of defining the “patch upon which the jumper lands,” it is possible to 
view this experiment as having infinitely many possible outcomes, correspond- 
ing to the infinitely many patches that might be landed upon. But we can collect 
these infinitely many possibilities into a finite number of different categories 
which are, one would think, much more interesting and useful to the jumper 
and everyone else concerned than are the undifferentiated infinity of fundamen- 
tal possible outcomes. For instance, our finite list of possibilities might look 
like: (1) the jumper lands on some power line(s); (2) the jumper lands in a tree; 
...3 (n) none of the above (in case we overlooked a possibility). 

Clearly there are infinitely many ways to make a finite list of outcomes of 
this experiment. How would you, in practice, choose a list? That depends on 
your concerns. If the jumper is a parachutist, items like “lands in water” should 
probably be on the list. If the jumper is a suicidal terrorist carrying an atom 
bomb, items like “lands within 15 miles of the center of Cincinnati” might well 
be on the list. There is some art in the science of parsing the outcomes to suit 
your interest. Never forget the constraint that one and only one of the listed 
outcomes will occur, whatever happens. For instance, it is unacceptable to have 
“lands in water” and “lands within 15 miles of the center of Cincinnati” on the 


© 2003 by CRC Press LLC 


2 1 Elementary Probability 


same list, since it is possible for the jumper to land in water within 15 miles of 
the center of Cincinnati. (See Exercise | at the end of this section.) 

Now consider the definition at the beginning of this section. The set S is in- 
terpretable as the finite list of outcomes, or of outcome categories, of whatever 
experiment we have at hand. The function P is, as the term probability as- 
signment suggests, an assessment or measure of the likelihoods of the different 
outcomes. 

There is nothing in the definition that tells you how to provide yourself with 
S and P, given some actual experiment. The jumper-from-the-airplane example 
is one of a multitude that show that there may be cause for debate and occasion 
for subtlety even in the task of listing the possible outcomes of the given ex- 
periment. And once the outcomes are listed to your satisfaction, how do you 
arrive at a satisfactory assessment of the likelihoods of occurrence of the vari- 
ous outcomes? That is a Jong story, only the beginnings of which will be told 
in this chapter. There are plenty of people—actuaries, pollsters, quality-control 
engineers, market analysts, and epidemiologists—who make their living partly 
by their sophistication in the matter of assigning probabilities; assigning proba- 
bilities in different situations is the problem at the center of applied statistics. 

To the novice, it may be heartening to note that two great minds once got 
into a confused dispute over the analysis of a very simple experiment. Before 
the description of the experiment and the dispute, we interject a couple of com- 
ments that will be referred to throughout this chapter. 


Experiments with outcomes of equal likelihood 


If it is judged that the different outcomes are equally likely, then the condition 
Dee gs P(s) = 1 forces the probability assignment P(s) = sy for alls € S, 
where |S| stands for the size of S, also known as the number of elements in S, 
also known as the cardinality of S. 

Coins. In this chapter, each coin shall have two sides, designated “heads” 
and “tails,” or H and T, for short. On each flip, toss, or throw of a coin, one 
of these two “comes up.” Sometimes H will be an abbreviation of the phrase 
“heads comes up,” and 7 similarly. Thus, in the experiment of tossing a coin 
once, the only reasonable set of outcomes is abbreviable {H, T}. 

A fair coin is one for which the outcomes H and T of the one-toss experi- 
ment are equally likely—i.e., each has probability 1/2. 


The D’Alembert-Laplace controversy 


D’ Alembert and Laplace were great mathematicians of the 18th and 19th cen- 
turies. Here is the experiment about the analysis of which they disagreed: a fair 
coin is tossed twice. 

The assumption that the coin is fair tells us all about the experiment of 
tossing it once. Tossing it twice is the next-simplest experiment we can perform 
with this coin. How can controversy arise? Consider the question: what is the 
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probability that heads will come up on each toss? D’Alembert’s answer: 1/3. 
Laplace’s answer: 1/4. 

D’Alembert and Laplace differed right off in their choices of sets of out- 
comes. D’Alembert took Sp = {both heads, both tails, one head and one tail}, 
and Laplace favored S; = {H H, HT, TH, TT} where, for instance, HT stands 
for “heads on the first toss, tails on the second.” Both D’ Alembert and Laplace 
asserted that the outcomes in their respective sets of outcomes are equally likely, 
from which assertions you can see how they got their answers. Neither provided 
a convincing justification of his assertion. 

We will give a plausible justification of one of the two assertions above in 
Section 1.3. Whether or not the disagreement between D’ Alembert and Laplace 
is settled by that justification will be left to your judgment. 


Exercises 1.1 


1. Someone jumps out of a plane over Ohio. You are concerned with whether 
or not the jumper lands in water, and whether or not the jumper lands within 
15 miles of the center of Cincinnati, and with nothing else. [Perhaps the 
jumper carries a bomb that will not go off if the jumper lands in water, and 
you have relatives in Cincinnati. ] 


Give an acceptable list of possible outcomes, as short as possible, that will 
permit discussion of your concerns. [Hint: the shortest possible list has 
length 4.] 


2. Notice that we can get D’Alembert’s set of outcomes from Laplace’s by 
“amalgamating” a couple of Laplace’s outcomes into a single outcome. 


More generally, given any set S of outcomes, you can make a new set of 


outcomes S by partitioning S into non-empty sets P),..., Pm and setting 
S ={P,..., Pm}. [To say that subsets P|,..., Pm of S partition S is to 
say that P},..., Pm are pairwise disjoint, ic.,@= P/O Pj, l<i<j<m, 


and cover S, ie., S = Ji", Pj. Thus, “partitioning” is “dividing up into 
non-overlapping parts.”’] 


How may different sets of outcomes can be made, in this way, from a set of 
outcomes with four elements? 


1.2 Events 


Throughout, (S, P) will be some finite probability space. An event in this space 
is a subset of S. If E C S is an event, the probability of E, denoted P(E), is 


P(E) =) P(s). 


seE 
Some elementary observations: 
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(i) ifs € S, then P({s}) = P(s); 
(ii) PG) =0 
(i) P(S) = 1; 
(iv) if the outcomes in S are equally likely, then, foreach E C S, P(E) = 
|E|/|S|. 

Events are usually described in plain English, by a sentence or phrase in- 
dicating what happens when that event occurs. The set indicated by such a 
description consists of those outcomes that satisfy, or conform to, the descrip- 
tion. For instance, suppose an urn contains red, green, and yellow balls; suppose 
that two are drawn, without replacement. We take S = {rr,rg,ry, gr, gg, gy, 
yr, yg, yy}, in which, for instance, rg is short for “a red ball was drawn on the 
first draw, and a green on the second.” Let E = “no red ball was drawn.” Then 
E={88,8y,y8, yy}. 

Notice that the verbal description of an event need not refer to the set S 
of outcomes, and thus may be represented differently as a set of outcomes for 
different choices of S. For instance, in the experiment of tossing a coin twice, 
the event “both heads and tails came up” is a set consisting of a single out- 
come in D’Alembert’s way of looking at things, but consists of two outcomes 
according to Laplace. (Notice that the event “tails came up on first toss, heads 
on the second” is not an admissible event in D’ Alembert’s space; this does not 
mean that D’Alembert was wrong, only that his analysis is insufficiently fine 
to permit discussion of certain events associable with the experiment. Perhaps 
he would argue that distinguishing between the tosses, labeling one “the first” 
and the other “the second,” makes a different experiment from the one he was 
concerned with.) 

Skeptics can, and should, be alarmed by the “definition” above of the prob- 
ability P(E) of an event E. If an event E has a description that makes no 
reference to the set S of outcomes, then E should have a probability that does 
not vary as you consider different realizations of E as a subset of different out- 
come sets S. Yet the probability of E is “defined” to be )°,., P(s), which 
clearly involves S and P. This “definition” hides an assertion that deserves 
our scrutiny. The assertion is that, however an event EF is realized as a subset 
of a set S of outcomes, if the probability assignment P to S is “correct,” then 
the number )°,., P(s) will be “correct,” the “correct” probability of E by any 
“correct” assessment. 

Here is an argument that seems to justify the equation P(E) = )0 <7 P(s) 
in all cases where we agree that P is a correct assignment of probabilities to the 
elements of S, and that there is a correct 4 priori probability P(E) of the event 
E, realizable as a subset of S. Let S= (S \ E) U{E}. That is, we are forming a 
new set of outcomes by amalgamating the outcomes in E into a single outcome, 
which we will denote by E. What shall the correct probability assignment P 
to S be? P(E ) ought to be the sought-after P(E), the correct probability of E. 
Meanwhile, the outcomes of S \ E are indifferent to our changed view of the 
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experiment; we should have P(s) = P(s) fors€S\E. Thenl= yaks P(s) = 


P(E) + DiseS\E P(s) = P(E)+(1— doe P(s)), which implies the desired 
equation. 


Definition Events E; and E2 ina probability space (S, P) are mutually exclu- 
sive in case P(E, E2) = 0. 


In common parlance, to say that two events are mutually exclusive is to say 
that they cannot both happen. Thus, it might seem reasonable to define E; and 
E> to be mutually exclusive if and only if E; N Ez = @, a stronger condition 
than P(E, 1 Ez) = 0. It will be convenient to allow outcomes of experiments 
that have zero probability just because ruling out such outcomes may require 
a lengthy verbal digression or may spoil the symmetry of some array. In the 
service of this convenience, we define mutual exclusivity as above. 


Example Suppose an urn contains a number of red and green balls, and exactly 
one yellow ball. Suppose that two balls are drawn, without replacement. If, as 
above, we take S = {rr,rg,ry, gr, gg, gy, yr, yg, yy}, then the outcome yy is 
impossible. However we assign probabilities to S, the only reasonable proba- 
bility assignment to yy is zero. Thus, if E; = “a yellow ball was chosen on the 
first draw” and E 2 = “a yellow ball was chosen on the second draw,” then EF 
and £2 are mutually exclusive, even though £1 E2 = {yy} 4G. 

Why not simply omit the impossible outcome yy from S? We may, for 
some reason, be performing this experiment on different occasions with differ- 
ent urns, and, for most of these, yy may be a possible outcome. It is a great 
convenience to be able to refer to the same set S of outcomes in discussing 
these different experiments. 


Some useful observations and results As heretofore, (S, P) will be a proba- 
bility space, and E, F,, FE), E2, etc., will stand for events in this space. 


1.2.1 If E, C Eo, then P(E,) < P(E»). 


1.2.2 If E, and E2 are mutually exclusive, and F, C E, and F2 C E», then F; 
and F> are mutually exclusive. 


1.2.3 If E,..., Em are pairwise mutually exclusive (meaning E; and Ej; are 
mutually exclusive when 1 <i < j <m), then 


P(_J Ei) =D) P(E). 
i=1 i=1 


For a “clean” proof, go by induction on m. For an instructive proof, first con- 
sider the case in which F),..., Ej are pairwise disjoint. 


1.2.4 IFE CF CS, then P(F \ E) = P(F)— P(E). 
Proof: Apply 1.2.3 with m =2, E) = E and E2=F\ E. O 
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1.2.5 P(EUF)+P(ENF) = P(E) + P(F). 


Proof: Observe that EU F = (EN F)U(E\(ENF))U(F \ (EN F)), aunion 
of pairwise disjoint events. Apply 1.2.3 and 1.2.4. O 


1.2.6 P(E)+P(S\ E)=1. 
Proof: This is a corollary of 1.2.4. O 


[When propositions are stated without proof, or when the proof is merely 
sketched, as in 1.2.4, it is hoped that the student will supply the details. The or- 
der in which the propositions are stated is intended to facilitate the verification. 
For instance, proposition 1.2.2 follows smoothly from 1.2.1.] 


Example Suppose that, in a certain population, 40% of the people have red 
hair, 25% tuberculosis, and 15% have both. What percentage has neither? 

The experiment that can be associated to this question is: choose a person 
“at random” from the population. The set S of outcomes can be identified with 
the population; the outcomes are equally likely if the selection process is indeed 
“random.” 

Let R stand for the event “the person selected has red hair,’ and T for the 
event “the person selected has tuberculosis.” As subsets of the set of outcomes, 
R and T are the sets of people in the population which have red hair and tuber- 
culosis, respectively. We are given that P(R) = 40/100, P(T) = 25/100, and 
P(RNT) = 15/100. Then 

P(S\(RUT))=1— P(RUT) [by 1.2.6] 
= 1-—[P(R)+ P(T)— P(RNT)] [by 1.2.5] 

40 25 15 50 

LF00 100 700! ~ 100° 


Answer: 50% have neither. 


Exercises 1.2 
1. In a certain population, 25% of the people are small and dirty, 35% are 
large and clean, and 60% are small. What percentage are dirty? 


2. In the experiment of tossing a fair coin twice, let us consider Laplace’s 
set of outcomes, {HH,HT,TH,TT}. We do not know how to assign 
probabilities to these outcomes as yet, but surely H T and T H ought to have 
the same probability, and the events “heads on the first toss” and “heads on 
the second toss” each ought to have probability 1/2. 


Do these considerations determine a probability assignment? 


3. In a certain population, 30% of the people have acne, 10% have bubonic 
plague, and 12% have cholera. In addition, 


© 2003 by CRC Press LLC 


1.3 Conditional probability 7 


8% have acne and bubonic plague, 
7% have acne and cholera, 

4% have bubonic plague and cholera, 
and 2% have all three diseases. 


What percentage of the population has none of the three diseases? 


== 


1.3 Conditional probability 


Definition Suppose that (S, P) is a finite probability space, E;, Ez C S, and 
P(E) #0. The conditional probability of E\, given Ez, is P(E, | Ex) = 
P(E\ E2)/P(E2). 


Interpretation. You may as well imagine that you were not present when 
the experiment or action took place, and you received an incomplete report on 
what happened. You learn that event E2 occurred (meaning the outcome was 
one of those in E2), and nothing else. How shall you adjust your estimate of 
the probabilities of the various outcomes and events, in light of what you now 
know? The definition above proposes such an adjustment. Why this? Is it valid? 

Justification. Supposing that Ez has occurred, let’s make a new probability 
space, (£2, P), taking Ey to be the new set of outcomes. What about the new 
probability assignment, P? We assume that the new probabilities P\(s), S€ Eo, 
are proportional to the old probabilities P(s); that is, for some number r, we 
have P(s) =rP(s) forall s € Eo. 

This might seem a reasonable assumption in many specific instances, but is 
it universally valid? Might not the knowledge that E> has occurred change our 
assessment of the relative likelihoods of the outcomes in E2? If some outcome 
in Ez was judged to be twice as likely as some other outcome in E2 before 
the experiment was performed, must it continue to be judged twice as likely as 
the other after we learn that E2 has occurred? We see no way to convincingly 
demonstrate the validity of this “proportionality” assumption, nor do we have in 
mind an example in which the assumption is clearly violated. We shall accept 
this assumption with qualms, and forge on. 

Since P is to be a probability assignment, we have 


i= P(s)=r >> P(s) =r P(E), 
sek, SEE, 


so r = 1/P(E2). Therefore, the probability that E; has occurred, given that E2 
has occurred, ought to be given by 
P(E, E2) 


P(E\|E2)=P(EiNExy)= > Pw)=r D> P(s)= rE 


SEE\NE2 SEE\NE2 


End of justification. 


© 2003 by CRC Press LLC 


8 1 Elementary Probability 


Application to multi-stage experiments 


Suppose that we have in mind an experiment with two stages, or sub-experi- 
ments. (Examples: tossing a coin twice, drawing two balls from an urn.) Let 


X1,...,Xn denote the possible outcomes at the first stage, and y1,..., ym» at the 
second. Suppose that the probabilities of the first-stage outcomes are known: 
say 


P(x; occurs at the first stage”) = pj, i=1,...,n. 


Suppose that the probabilities of the y; occurring are known, whenever it is 
known what happened at the first stage. Let us say that 


P(“yj; occurs at the 2nd stage, supposing x; occurred at the first”) = qj;. 


Now we will consider the full experiment. We take as the set of outcomes 
the set of ordered pairs 


S = {x1,...,Xn} X {y1,---, ¥m} 
{Gn yells. 4s eda, 


in which (x;, y;) is short for the statement “x; occurred at the first stage, and y; 
at the second.” What shall the probability assignment P be? 

Let E; = “x; occurred at the first stage” = {(x), y1),.--, (47, Ym) }, for each 
ie {l,...,n}, and let F; =“y; occurred at the second stage” = {(x1, yj),.--, 
(Xn, yj)}, for each j € {1,...,m}. Even though we do not yet know what P is, 
we supposedly know something about the probabilities of these events. What 


we know is that P(E;) = p;, and P(F; | E;) = qij, for eachi and j. Therefore, 
P(EINF)) — PGi. yp) Pp 

qi = P(F) | E) = — a 
P(E) Di 


which implies 
P(x, yj) = PUG. yp) = Pidi;- 


We now know how to assign probabilities to outcomes of two stage experi- 
ments, given certain information about the experiment that might, plausibly, be 
obtainable a priori, from the description of the experiment. To put it simply, 
you multiply. (But what do you multiply?) Something similar applies to ex- 
periments of three or more stages. It is left to you to formulate what you do to 
assign probabilities to the outcomes of such experiments. 


Examples 


1.3.1 An urn contains 5 red, 12 green, and 8 yellow balls. Three are drawn 
without replacement. 


(a) What is the probability that a red, a green, and a yellow ball will be 
drawn? 


(b) What is the probability that the last ball to be drawn will be green? 
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Solution and discussion. If we identify the set of outcomes with the set of 
all sequences of length three of the letters r, g, and y, in the obvious way (e.g., 
ryy stands for a red ball was drawn first, then two yellows), then we will have 
27 outcomes; no need to list them all. The event described in part (a) is 


E=({rgy,ryg,8ry,8yr, yg, yar}. 


These are assigned probabilities, thus: 


etc. How are these obtained? Well, for instance, to see that P(rgy) = x : xt . x, 


you reason thus: on the first draw, we have 25 balls, all equally likely to be 
drawn (it is presumed from the description of the experiment), of which 5 are 
red, hence a probability of * of a red on the first draw; having drawn that red, 
there is then a or probability of a green on the second draw, and having drawn 
first a red, then a green, there is probability a of a yellow on the third draw. 
Multiply. 

In this case, we observe that all six outcomes in E have the same probability 
assignment. Thus the answer to (a) is 6- * . a . x = i. 

For (b), we cleverly take a different set of outcomes, and ponder the event 


F={NNg,Ngg,gNg, ggg}, 
where WN stands for “not green.” We have 
13. 12 12 13 12 11 12 13 11° 12 11 «#10 


GR) ma ee Ae ee ne a 
@ 25 24 31 35 24 73 1 35 24 31 35 24 23 


De i 

= —/— (13-(124+ 11) + (134+ 10)-11 
lo 3-(12+11)+ (13+ 10)-11)] 
1. “if 12 

= —[(——23-24]= =, 
25'23-24 25 


which is, interestingly, the probability of drawing a green on the first draw. 
Could we have foreseen the outcome of this calculation, and saved ourselves 
some trouble? It is left to you to decide whether or not the probability of draw- 
ing a green on draw number k, | < k < 25, when we are drawing without re- 
placement, might depend on k. 


1.3.2 A room contains two urns, A and B. A contains nine red balls and one 
green ball; B contains four red balls and four green balls. The room is dark- 
ened, a man stumbles into it, gropes about for an urn, draws two balls without 
replacement, and leaves the room. 

(a) What is the probability that both balls will be red? 


(b) Suppose that one ball is red and one is green: what is the probability that 
urn A now contains only eight balls? 
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Solutions. (a) Using obvious and self-explanatory abbreviations, 
P (“both red”) = P(Arr)+ P(Brr) ee eee 
oth red”) = ee ope 
“ ™ 3109 | 287 


(b) We calculate 


P(“Urn A was the urn chosen” | “One ball is red and one green’”’) 


__P(fArg.Agr}) 3755 +3105 _7 
P({Arg, Agr, Brg, Bgr}) Suh +amotsegtae3 27 
Notice that this result is satisfyingly less than 1/2, the a priori probability that 


urn A was chosen. 


Exercises 1.3 


1. An urn contains six red balls, five green balls, and three yellow balls. Two 
are drawn without replacement. What is the probability that at least one is 
yellow? 


2. Same question as in |, except that three balls are drawn without replace- 
ment. 


3. Same question as in |, except that the drawing is with replacement. 


4. An actuary figures that for a plane of a certain type, there is a 1 in 100,000 
chance of a crash somewhere during a flight from New York to Chicago, 
and a | in 150,000 chance of a crash somewhere during a flight from 
Chicago to Los Angeles. 


A plane of that type is to attempt to fly from New York to Chicago and then 
from Chicago to Los Angeles. 


(a) What is the probability of a crash somewhere along the way? [Please 
do not use your calculator to convert to a decimal approximation. ] 

(b) Suppose that you know that the plane crashed, but you know noth- 
ing else. What is the probability that the crash occurred during the 
Chicago-L.A. leg of the journey? 


5. Urn A contains 11 red and seven green balls, urn B contains four red and 
one green, and urn C contains two red and six green balls. The three urns 
are placed in a dark room. Someone stumbles into the room, gropes around, 
finds an urn, draws a ball from it, lurches from the room, and looks at the 
ball. It is green. What is the probability that it was drawn from urn A? 


In this experiment, what is the probability that a red ball will be chosen? 
What is the proportion of red balls in the room? 


6. Who was right, D’ Alembert or Laplace? Or neither? 


7. What is the probability of heads coming up exactly twice in three flips of a 
fair coin? 
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8. Let p; and qj; be as in the text preceding these exercises. What is the value 
n m 
of i=l ei Pidij? 
9. On each of three different occasions, a fair coin is flipped twice. 


(a) On the first occasion, you witness the second flip, but not the first. You 
see that heads comes up. What is the probability that heads came up 
on both flips? 

(b) On the second occasion, you are not present, but are shown a video of 
one of the flips, you know not which; either is as likely as the other. 
On the video, heads comes up. What is the probability that heads came 
up on both flips? 


(c 


wm 


On the third occasion, you are not present; a so-called friend teases 
you with the following information, that heads came up at least once 
in the two flips. What is the probability that heads came up on both 
flips? 


*10. For planes of a certain type, the actuarial estimate of the probability of a 
crash during a flight (including take-off and landing) from New York to 
Chicago is p1; from Chicago to L.A., po. 


In experiment #1, a plane of that type is trying to fly from New York to 
L.A., with a stop in Chicago. Let a denote the conditional probability that, 
if there is a crash, it occurs on the Chicago-L.A. leg of the journey. 


In experiment #2, two different planes of the fatal type are involved; one 
is to fly from N.Y. to Chicago, the other from Chicago to L.A. Let b de- 
note the conditional probability that the Chicago-L.A. plane crashed, if it is 
known that at least one of the two crashed, and let c denote the conditional 
probability that the Chicago-L.A. plane crashed, if it is known that exactly 
one of the two crashed. 

Express a, b, and c in terms of p; and p2, and show that a < c < b for all 
possible pj, p2. 


——=—=EE. EE 


1.4 Independence 


Definition Suppose that (S, P) is a finite probability space, and E), Ez CS. 
The events £1, E2 are independent if and only if 


P(E| OE) = P(E\)P(E2). 


1.4.1 Suppose that both P(E,) and P(E2) are non-zero. The following are 
equivalent: 


(a) E, and E> are independent; 
(b) P(E, | E2) = P(E1); 
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(c) P(E2 | £1) = P(E). 
Proof: Left to you. O 


The intuitive meaning of independence should be fairly clear from the 
proposition; if the events have non-zero probability, then two events are inde- 
pendent if and only if the occurrence of either has no influence on the likelihood 
of the occurrence of the other. Besides saying that two events are independent, 
we will also say that one event is independent of another. 

We shall say that two stages, say the ith and jth, i ~ j, of a multi-stage 
experiment, are independent if and only if for any outcome x at the ith stage, 
and any outcome y at the jth stage, the events “x occurred at the ith stage” and 
“y occurred at the jth stage” are independent. This means that no outcome at 
either stage will influence the outcome at the other stage. It is intuitively evident, 
and can be proven, that when two stages are independent, any two events whose 
descriptions involve only those stages, respectively, will be independent. See 
Exercises 2.2.5 and 2.2.6. 


Exercises 1.4 


1. Suppose that E and F are independent events in a finite probability space 
(S, P). Show that 


(a) E and S \ F are independent; 
(b) S\ E and F are independent; 
(c) S\ E and S \ F are independent. 


2. Show that each event with probability 0 or 1 is independent of every event 
in its space. 


3. Suppose that S = {a, b,c, d}, all outcomes equally likely, and E = {a, b}. 
List all the events in this space that are independent of E. 


4. Suppose that S = {a, b, c,d, e}, all outcomes equally likely, and E = {a, b}. 
List all the events in this space that are independent of E. 


5. Urn A contains 3 red balls and | green ball, and urn B contains no red balls 
and 75 green balls. The action will be: select one of urns A, B, or C at 
random (meaning they are equally likely to be selected), and draw one ball 
from it. 


How many balls of each color should be in urn C, if the event “a green 
ball is selected” is to be independent of “urn C is chosen,” and urn C is to 
contain as few balls as possible? 


6. Is it possible for two events to be both independent and mutually exclusive? 
If so, in what circumstances does this happen? 


7. Suppose that S = {a,b,c,d}, and these outcomes are equally likely. Sup- 
pose that FE = {a,b}, F = {a,c}, and G = {b,c}. Verify that E, F, and G 
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are pairwise independent. Verify that 


P(EN FOG) P(E)P(F)P(G). 


Draw a moral by completing this sentence: just because E),..., Ex are 
pairwise independent, it does not follow that . What if 
E\,..., Ex belong to the distinct, pairwise independent stages of a multi- 


stage experiment? 


———————S SSC 


1.5 Bernoulli trials 


Definition Suppose that n > 0 and k are integers; (7). read as “n-choose-k” or 
“the binomial coefficient n,k,’ is the number of different k-subsets of an n-set 


(a set with n elements). 


It is hoped that the reader is well-versed in the fundamentals of the binomial 
coefficients G): and that the following few facts constitute a mere review. 

Note that (7) = 0 for k > n and for k < 0, by the definition above, and that 
(5) = (") = 1 and (7) =n for all non-negative integers n. Some of those facts 
force us to adopt the convention that 0! = 1, in what follows and forever after. 


1.5.1 For0 <k <n, (?) ifi<k,(t)= 2 ee 
1.5.2 (;) = (174): 


1.53 (7) + (ch) = (efi): 
Definition An alphabet is just a non-empty finite set, and a word of length n 
over an alphabet A is a sequence, of length n, of elements of A, written without 
using parentheses and commas; for instance, 101 is a word of length 3 over 
{0, 1}. 


ma n\ 
~~ k(n—k)!* 


1.5.4 Suppose a and £ are distinct symbols. Then a = |{w; w is a word of 
lengthn over {a, 8} and a appears exactly k times in w}|. 
1.5.5 For any numbers a, B, 
n 
(a+p)"=>> (i)ater. 
k=0 

15.6 2° =) Fa) 

Of the propositions above, 1.5.1 and 1.5.4 are the most fundamental, in 


that each of the others can be seen to follow from these two. However, 1.5.2 
and 1.5.3 also have “combinatorial” proofs that appeal directly to the definition 


of (7): 

k 

Proposition 1.5.5 is the famous binomial theorem of Pascal; it is from the 
role of the (1) in this proposition that the term “binomial coefficient” arises. 
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Definitions A Bernoulli trial is an experiment with exactly two possible out- 
comes. A sequence of independent Bernoulli trials is a multi-stage experiment 
in which the stages are the same Bernoulli trial, and the stages are independent. 


Tossing a coin is an example of a Bernoulli trial. So is drawing a ball 
from an urn, if we are distinguishing between only two types of balls. If the 
drawing is with replacement (and, it is understood, with mixing of the balls 
after replacement), a sequence of such drawings from an urn is a sequence of 
independent Bernoulli trials. 

When speaking of some unspecified Bernoulli trial, we will call one pos- 
sible outcome Success, or S, and the other Failure, or F. The distinction is 
arbitrary. For instance, in the Bernoulli trial consisting of a commercial DC-10 
flight from New York to Chicago, you can let the outcome “plane crashes” cor- 
respond to the word Success in the theory, and “plane doesn’t crash” to Failure, 
or the other way around. 

Another way of saying that a sequence of Bernoulli trials (of the same type) 
is independent is: the probability of Success does not vary from trial to trial. 
Notice that if the probability of Success is p, then the probability of Failure is 


1—-p. 


1.5.7 Theorem Suppose the probability of Success in a particular Bernoulli 
trial is p. Then the probability of exactly k successes in a sequence of n inde- 


pendent such trials is 
n\ Ok -k 
1—p)"™. 
(j)o GP) 


Proof: Let the set of outcomes of the experiment consisting of the sequence of 
n independent Bernoulli trials be identified with {S, F'}”, the set of all sequences 
of length n of the symbols S and F. If u is such a sequence in which S appears 
exactly k times, then, by what we know about assigning probabilities to the 
outcomes of multi-stage experiments, we have 


P(u) = p(1— py". 


Therefore, 
P (“exactly k successes”) = a P(u) 
S occurs exactly 
k times in u 
= |{u; S occurs exactly k times in u}|p*(1 _ py 
N\ k —k 
= eae) haa 
( ') p’(l— p) 
by 1.5.4 


In case k < 0 or k > n, when there are no such u, the truth of the theorem 
follows from the fact that (4) =0. O 
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Example Suppose an urn contains 7 red and 10 green balls, and 20 balls are 
drawn with replacement (and mixing) after each draw. What is the probability 
that (a) exactly 4, or (b) at least 4, of the balls drawn will be red? 


Answers: 

20 7 \4,10 16 20-19-18-17 7 .4,10\ 16 
@ (Pio = eS Oo" 
(b) 


20 


= (7) Oey 


(Dor 


10,20 7\/10\19 20-19, 7 \2/10\18 
Ga) eae) ae) ee) 
20-19-18, 7 \3,10,\17 


Observe that, in (b), the second expression for the probability is much more 
economical and evaluable than the first. 


Exercises 1.5 
1. An urn contains five red, seven green, and three yellow balls. Nine are 
drawn, with replacement. Find the probability that 


(a) exactly six of the balls drawn are green; 


(b) at least two of the balls drawn are yellow; 
(c) at most four of the balls drawn are red. 


2. In eight tosses of a fair coin, find the probability that heads will come up 


(a) exactly three times; 
(b) at least three times; 
(c) at most three times. 
3. Show that the probability of heads coming up exactly n times in 2n flips of 
a fair coin decreases with n. 


*4, Find a simple representation of the polynomial }“7_ (ae (1-2xy-*. 


1.6 An elementary counting principle 


1.6.1 Suppose that, in a k-stage experiment, for each i, 1 <i <k, whatever 
may have happened in stages preceding, there are exactly nj outcomes possible 
at the ith stage. Then there are Neat ; possible sequences of outcomes in the k 


stages. 
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The proof can be done by induction on k. The word “preceding” in the 
statement above may seem to some to be too imprecise, and to others to be too 
precise, introducing an assumption about the order of the stages that need not be 
introduced. I shall leave the statement as it is, with “preceding” to be construed, 
in applications, as the applier deems wise. 

The idea behind the wording is that the possible outcomes at the different 
stages may depend on the outcomes at other stages (those “preceding’’), but 
whatever has happened at those other stages upon which the list of possible 
outcomes at the ith stage depends, the list of possible outcomes at the ith stage 
will always be of the same length, 7;. 

For instance, with k = 2, suppose stage one is flipping a coin, and stage two 
is drawing a ball from one of two urns, RWB, which contains only red, white, 
and blue balls, and BGY, which contains only blue, green, and yellow balls. 
If the outcome at stage | is H, the ball is drawn from RWB; otherwise, from 
BGY. In this case, there are five possible outcomes at stage 2; yet n3 = 3, and 
there are 2-3 = 6 possible sequences of outcomes of the experiment, namely 
HR, HW, HB, TB, TG, and TY, in abbreviated form. If, say, the second 
urn contained balls of only two different colors, then this two-stage experiment 
would not satisfy the hypothesis of 1.6.1. 

In applying this counting principle, you think of a way to make, or con- 
struct, the objects you are trying to count. If you are lucky and clever, you will 
come up with a k-stage construction process satisfying the hypothesis of 1.6.1, 
and each object you are trying to count will result from exactly one sequence 
of outcomes or choices in the construction process. [But beware of situations in 
which the objects you want to count each arise from more than one construction 
sequence. See Exercise 4, below. ] 

For instance, to see that |A, x --- x Ax] = The |A;|, when Aj,..., Ag are 
sets, you think of making sequences (a1,...,ax), with aj € Aj, i = 1,...,k, 
by the obvious process of first choosing a; from A,, then az from Az, etc. 
For another instance, the number of different five-card hands dealable from a 
standard 52-card deck that are full houses is 13 (3) 12(5). [Why?] 


Exercises 1.6 

1. In the example above involving an urn, if the first urn contains balls of three 
different colors, and the second contains balls of two different colors, how 
many different possible sequences of outcomes are there in the two stages 
of the experiment? 

2. How many different words of length ¢ are there over an alphabet with n 
letters? What does this have to do with 1.6.1? 

3. In a certain collection of 23 balls, 15 are red and 8 are blue. How many 
different 7-subsets of the 23 balls are there, with 5 red and 2 blue balls? 


4. How many different five-card hands dealable from a standard deck are 
“two-pair” hands? 
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1.7* On drawing without replacement 


Suppose that an urn contains x red balls and y green balls. We draw n without 
replacement. What is the probability that exactly k will be red? 

The set of outcomes of interest is identifiable with the set of all sequences, 
of length n, of the symbols r and g. We are interested in the event consisting of 
all such sequences in which r appears exactly k times. 

We are not in the happy circumstances of Section 1.5, in which the prob- 
ability of a red would be unchanging from draw to draw, but we do enjoy one 
piece of good fortune similar to something in that section: the different out- 
comes in the event “exactly k reds” all have the same probability. (Verify!) For 
O0<k<n<x-+y, that probability is 


xQ—I])---@-k+DyQ—D---G-@—-H+) 
@+yaty—)-@+y—nt) 
x!ylaa~+y—n)! ; 
(x—k)(y—(n— A) ety)? 
this last expression is only valid when k < x andn —k < y, which are necessary 
conditions for the probability to be non-zero. Under those conditions, the last 
expression is valid for k = 0 and k =n, as well. 
By the same reasoning as in the proof of the independent Bernoulli trial 
theorem, 1.5.7, invoking 1.5.4, we have that 


P (“exactly k reds’’) 
_ [jt 
Nk (ety) (+ty—ntl) 


a (i) x!yl(xt+ty—n)! 
Ak a -—b/IGy-(—k/) (x+y)! 


for 0 < k <n, provided we understand this last expression to be zero when 
k>xorwhenn—k>y. 

There is another way of looking at this experiment. Instead of drawing n 
balls one after the other, suppose you just reach into the urn and scoop up n 
balls all at once. Is this really different from drawing the balls one at a time, 
when you look at the final result? Supposing the two to be the same, we can 
take as the set of outcomes all n-subsets of the x + y balls in the urn. Surely 
no n-subset is more likely than any other to be scooped up, so the outcomes 
are equiprobable, each with probability 1/ eas ) (provided n < x+y). How 
many of these outcomes are in the event “exactly k reds are among the n balls 
selected?” Here is a two-stage method for forming such outcomes: first take 
a k-subset of the x red balls in the urn, then an (nm — k)-subset of the y green 
balls (and put them together to make an n-set). Observe that different outcomes 
at either stage result in different n-sets, and that every n-set of these balls with 
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exactly k reds is the result of one run of the two stages. By 1.6.1, it follows that 


x y 
| “exactly k reds” | = : 
k}/\n-k 


Thus P(‘“exactly k reds”) = (7)(,,”,)/(*2”), providedn < x+y. 
Exercises 1.7 
1. Verify that the two different ways of looking at drawing without replace- 
ment give the same probability for the event “exactly k balls are red.” 


2. Suppose an urn contains 10 red, 13 green, and 4 yellow balls. Nine are 
drawn, without replacement. 


(a) What is the probability that exactly three of the nine balls drawn will 
be yellow? 

(b) Find the probability that there are four red, four green, and one yellow 
among the nine balls drawn. 


3. Find the probability of being dealt a flush (all cards of the same suit) in a 
five-card poker game. 


1.8 Random variables and expected, or average, value 


Suppose that (S, P) is a finite probability space. A random variable on this 
space is a function from S into the real numbers, R. If X :S — Ris a random 
variable on (S, P), the expected or average value of X is 


E(X) = > X(u)Plu). 
ueS 

Random variables are commonly denoted X, Y, Z, or with subscripts: Xj, 
X2,:-:. It is sometimes useful, as in 1.8.2, to think of a random variable as 
a measurement on the outcomes, and the average value as the average of the 
measurements. The average value of a random variable X is sometimes denoted 
X; you may recall that a bar over a letter connotes the arithmetic average, in 
elementary statistics. 

The word “expected” in “expected value” requires interpretation. As nu- 
merous examples show (see 1.8.1, below), a random variable which takes only 
integer values can have a non-integer expected value. The point is that the ex- 
pected value is not necessarily a possible value of the random variable, and thus 
is not really necessarily to be “expected” in any running of the experiment to 
which (S, P) is associated. 
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Examples 


1.8.1 The experiment consists of n independent Bernoulli trials, with probabil- 
ity p of success on each trial. Let X = “number of successes.” Finding F(X) in 
these circumstances is one of our goals; for now we’ ll have to be content with a 
special case. 

Sub-example: n = 3, p = 1/2; let the experiment be flipping a fair coin 
three times, and let “success” be “heads.” With X = “number of heads,” we 
have 

E(X) = X(HHH)P(HHAA)+X(AAT)P(AHAT) 
+---+X(TTT)P(TTT) 
1 12. 3 
= 7[3+24+24+1424+1+1+0]=—=-. 
8 8 2 
1.8.2 We have some finite population (voters in a city, chickens on a farm) 
and some measurement M that can be applied to each individual of the pop- 
ulation (blood pressure, weight, hat size, length of femur, ...). Once units 
are decided upon, M(s) is a pure real number for each individual s. We can 
regard M as a random variable on the probability space associated with the 
experiment of choosing a member of the population “at random.” Here S is 
the population, the outcomes are equiprobable (so P is the constant assignment 
1/|S|), and M assigns to each outcome (individual of the population) what- 
ever real number measures that individual’s M-measure. If S = {s1,...,5y}, 
then E(M) = (M(s1)+---+ M(s,))/n, the arithmetic average, or mean, of the 
measurements M(s),5s €S. 


1.8.3 Let an urn contain 8 red and 11 green balls. Four are drawn without 
replacement. Let X be the number of green balls drawn. What is E(X)? 

Note that we have at our disposal two different views of this experiment. 
In one view, the set of outcomes is the set of ordered quadruples of the symbols 
rand g;ie., S = {rrrr,rrrg,...,gggg}. In the other view, an outcome is a 
four-subset of the 19 balls in the urn. 

Taking the first view, we have 


8-7-6-5 8-7-6-11 
E(X) = 0: ——__—_ 41-4. —_____ 
19- 18-17-16 19- 18-17-16 
4) 8-7-11-10 8-11-10-9 
+2-(-)——____ 3. 4—_______ 
(3) eres 19- 18-17-16 
11-10-9-8 
+4.———_—_ 
19- 18-17-16 


Verify: E(X) =4- - Verify: the same answer is obtained if the other view of 
this experiment is taken. [Use 1.8.4, below, to express E(X).] 


1.8.4 Theorem Suppose X is a random variable on (S,P). Then E(X) = 
Seer A Ss). 
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Here, “X = x” is the event {u € S; X(u) = x} = X~'({x}). Note that 
X—!({x}) =@ unless x € ran(X). Since S is finite, so is ran(X). If ran(X) C 
{x1,...,X,}, and x1,...,x, are distinct, then 


Soar =x)= bees = Xx). 


xeER k=1 


Proof: Let x;,...,x, be the possible values that X might take; 1.e., ran(X) C 
{x1,...,x,}. Let, fork €{1,...,r}, Ey =“X = xy” = X—!({xz}). Then E},..., 
E,, partition S. Therefore, 


E(X)=)°X(u) PW) =>) SY) XP) 


uceS k=1uckE, 
i 2S 
= yom > P(u)= SP CED: Oo 
k=1 uc Ex k=1 


1.8.5 Corollary The expected or average number of successes, in a run of n 
independent Bernoulli trials with probability p of success in each, is 


n 
eH) pe = py". 
k 
k=1 
Proof: The possible values that X = “number of successes” might take are 0, 1, 
...,n, and P(X =k)= (7) ped — py’-*, for k € {0,1,...,n}, by 1.5.7. O 


1.8.6 Theorem If X,,...,X, are random variables on (S, P), and aj,...,dy 
are real numbers, then E()~y_1 akXk) = > -p—1 Qk E(Xx). 


Proof: 


£(S aeXs) = TS raeXe) (a) PW) 


ueS k=1 

=o axiw Pw 
ueS k=1 

= Soar Y5 Xi (u) Pu) = So an E (Xx). Oo 
k=1 uceS k=1 


1.8.7 Corollary The expected, or average, number of successes, in a run of n 
independent Bernoulli trials with probability p of success in each, is np. 


Proof: First, let n = 1. Let X be the “number of successes.” By definition and 
by the hypothesis, 


E(X)=0-(l—p)+1-p=1-p. 
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Now suppose that n > 1. Let X = “number of successes.” For k ¢€ {1,..., 
n}, let X,; = “number of successes on the kth trial.” By the case already done, 
we have that E(X;) = p,k = 1,...,n. Therefore, 


E(X) = E()) Xn) = D0 E(Xe) = Dp = ap. o 
k=1 k=1 k=1 


1.8.8 Corollary For 0 < p < 1, and any positive integer n, 


ye( i) ota — p)"*=np. 
k=1 


Proof: This follows from 1.8.5 and 1.8.7, provided there exists, for each p € 
[0, 1], a Bernoulli trial with probability p of success. You are invited to ponder 
the existence of such trials (problem | at the end of this section). 

Alternatively, using a sharper argument based on an elementary theorem 
about polynomials, the conclusion here (and the result called for in problem 
2 at the end of this section) follows from 1.8.5 and 1.8.7 and the existence of 
such trials for only n+ 1 distinct values of p. Therefore, the result follows 
from the existence of such trials for rational numbers p between 0 and 1. If 
Pp =5/b, where s and b are positive integers and s < b, let the Bernoulli trial 
with probability p of Success be: draw a ball from an urn containing s Success 
balls and b — s Failure balls. O 


1.8.9 Corollary Suppose k balls are drawn, without replacement, from an urn 
containing x red balls and y green balls, where 1 <k <x-+y. The expected 
number of red balls to be drawn is kx /(x + y). 


Proof: It is left to the reader to see that the probability of a red being drawn on 
the ith draw, 1 <i <k, is the same as the probability of a red being drawn on 
the first draw, namely x/(x + y). [Before the drawing starts, the various x + y 
balls are equally likely to be drawn on the ith draw, and x of them are red.] 
Once this is agreed to, the proof follows the lines of that of 1.8.7. O 


Exercises 1.8 


1. Suppose 0 < p < 1. Describe a Bernoulli trial with probability p of Suc- 
cess. You may suppose that there is a way of “picking a number at random” 
from a given interval. 


2. Suppose that n is a positive integer. Find a simple representation of the 
polynomial )yy_, k(j)x*(— x)". 


*3. State a result that is to 1.8.9 as 1.8.8 is to 1.8.7. 


4. An urn contains 4 red and 13 green balls. Five are drawn. What is the 
expected number of reds to be drawn if the drawing is 


(a) with replacement? 
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(b) without replacement? 


5. For any radioactive substance, if the material is not tightly packed, it is 
thought that whether or not any one atom decays is independent of whether 
or not any other atom decays. [We are all aware, I hope, that this assump- 
tion fails, in a big way, when the atoms are tightly packed together. ] 


For a certain mystery substance, let p denote the probability that any par- 
ticular atom of the substance will decay during any particular 6-hour period 
at the start of which the atom is undecayed. A handful of this substance is 
left in a laboratory, in spread-out, unpacked condition, for 24 hours. At the 
end of the 24 hours it is found that approximately 1/10 of the substance 
has decayed. Find p, approximately. [Hint: let n be the number of atoms, 
and think of the process of leaving the substance for 24 hours as n inde- 
pendent Bernoulli trials. Note that p here is not the probability of Success, 
whichever of two possibilities you choose to be Success, but that probabil- 
ity is expressible in terms of p. Assume that the amount of substance that 
decayed was approximately the “expected” amount. ] 


6. An actuary reckons that for any given year that you start alive, you have a 
1 in 6,000 chance of dying during that year. 
You are going to buy $100,000 worth of five-year term life insurance. How- 


ever you pay for the insurance, let us define the payment to be fair if the 
life insurance company’s expected gain from the transaction is zero. 


(a) In payment plan number one, you pay a single premium at the begin- 
ning of the five-year-coverage period. Assuming the actuary’s estimate 
is correct, what is the fair value of this premium? 

(b) In payment plan number two, you pay five equal premiums, one at the 
beginning of each year of the coverage, provided you are alive to make 
the payment. Assuming the actuary’s estimate is correct, what is the 
fair value of this premium? 


7. (a) Find the expected total showing on the upper faces of two fair dice 
after a throw. 


(b) Same question as in (a), except use backgammon scoring, in which 
doubles count quadruple. For instance, double threes count 12. 


= 
1.9 The Law of Large Numbers 
1.9.1 Theorem Suppose that, for a certain Bernoulli trial, the probability of 
Success is p. For a sequence of n independent such trials, let X;, be the random 


variable “number of Successes in the n trials.” Suppose € > 0. Then 


P(|-* pl <e) > Lasn + 0. 
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For instance, if you toss a fair coin 10 times, the probability that |70 — 
5! < iw: i.e., that the number of heads, say, is between 4.9 and 5.1, is just the 


probability that heads came up exactly 5 times, which is (°) hs = we, not very 


close to 1. Suppose the same coin is tossed a thousand times; the probability that 
| * 1000 - 3! < Ww is the probability that between 490 and 510 heads appear in 
the thousand flips. By approximation methods that we will not go into here, this 
probability can be shown to be approximately 0.45. The probability of heads 
coming up between 4900 and 5100 times in 10,000 flips is around 0.95. 

For a large number of independent Bernoulli trials of the same species, it is 
plausible that the proportion of Successes “ought” to be near p, the probability 
of Success on each trial. The Law of Large Numbers, 1.9.1, gives precise form 
to this plausibility. This theorem has a purely mathematical proof that will not 
be given here. 

Theorem 1.9.1 is not the only way of stating that the proportion of Suc- 
cesses will tend, with high probability, to be around p, the probability of Suc- 
cess on each trial. Indeed, Feller [18] maligns this theorem as the weakest and 
least interesting of the various laws of large numbers available. Still, 1.9.1 is 
the best-known law of large numbers, and ’twill serve our purpose. 


Exercises 1.9 

*1, With n, €, p, and X, as in 1.9.1, express P(e — p| <€) explicitly as a 
sum of terms of the form (7) p* — p)"-*. You will need the symbols [-] 
and |-|, which stand for “round up” and “round down,” respectively. 


2. An urn contains three red and seven green balls. Twenty are drawn, with 
replacement. What is the probability of exactly six reds being drawn? Of 
five, six, or seven reds being drawn? 


3. An urn contains an unknown number of red balls, and 10 balls total. You 
draw 100 balls, with replacement; 42 are red. What is your best guess as to 
the number of red balls in the urn? 


*4. A pollster asks some question of 100 people, and finds that 42 of the 100 
give “favorable” responses. The pollster estimates from this result that 
(probably) between 40 and 45 percent of the total population would be 
“favorable” on this question, if asked. 


Ponder the similarities and differences between this inference and that in 
problem 3. Which is more closely analogous to the inference you used in 
doing Exercise 1.8.5? 
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Information and Entropy 


2.1 How is information quantified? 


Information theory leaped fully clothed from the forehead of Claude Shannon 
in 1948 [63]. The foundation of the theory is a quantification of information, 
a quantification that a few researchers had been floundering toward for 20 or 
30 years (see [29] and [56]). The definition will appear strange and unnatural 
at first glance. The purpose of this first section of Chapter 2 is to acquaint the 
reader with certain issues regarding this definition, and finally to present the 
brilliant proof of its inevitability due, as far as we know, to Aczél and Daroczy 
in 1975 [1]. 

To begin to make sense of what follows, think of the familiar quantities area 
and volume. These are associated with certain kinds of objects—planar regions 
or regions on surfaces, in the case of area; bodies in space, in the case of volume. 
The assignment of these quantities to appropriate objects is defined, and the 
definitions can be quite involved; in fact, the final chapter on the question of 
mere definitions of area and volume was perhaps not written until the twentieth 
century, with the introduction of Lebesgue measure. 

These definitions are not simply masterpieces of arbitrary mathematical 
cleverness—they have to respond to certain human agreements on the nature of 
these quantities. The more elaborate definitions have to agree with simpler ways 
of computing area on simple geometric figures, and a planar region composed 
of two other non-overlapping planar regions should have area equal to the sum 
of the areas of the two. 

The class of objects to which the quantity information will be attached are 
occurrences of events associated with probabilistic experiments; another name 
for this class is random phenomena. It is supposed! that every such event or 
phenomenon E has a pre-assigned, a priori probability P(E) of occurrence. 
Here is Shannon’s definition of the “self-information” J/(£) of an event E: 


I(E) = log 1/P(E) = —log P(E). 
'Take care! Paradoxes and absurdities are known to be obtainable by loose manipulation of 


this assumption. These are avoidable by staying within the strict framework of a well-specified 
probabilistic experiment, in each situation. 


25 
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If P(E) =0, I1(E) =o. 

Feinstein [17] used other terminology that many would find more explana- 
tory than “self-information”: 7(E) is the amount of information disclosed or 
given off by the occurrence of E. This terminology coaxes our agreement to the 
premise that information ought to be a quantity attached to random phenomena 
with prior probabilities. And if that is agreed to, then it seems unavoidable that 
the quantity of information must be some function of the prior probability, i.e., 
I(E) = f(P(£)) for some function f, just because prior probability is the only 
quantity associated with all random phenomena, the only thing to work with. 

Lest this seem a frightful simplification of the wild world of random phe- 
nomena, to compute information content as a function of prior probability alone, 
let us observe that this sort of simplification happens with other quantities; in- 
deed, such simplification is one of the charms of quantification. Planar regions 
of radically different shapes and with very different topological properties can 
have the same area. Just so; why shouldn’t a stock market crash in Tokyo and 
an Ebola virus outbreak in the Sudan possibly release the same amount of in- 
formation? The quantification of information should take no account of the 
“quality” or category of the random phenomenon whose occurrence releases 
the information. 

Suppose we agree that J(£) ought to equal f(P(£)) for some function f 
defined on (0, 1], at least, for all probabilistic events E. Then why did Shannon 
take f(x) = log(1/x)? (And which log are we talking about? But we will 
deal with that question in the next subsection.) We will take up the question 
of Shannon’s inspiration, and Aczél and Daroczy’s final word on the matter, in 
Section 2.1.3. But to get acclimated, let’s notice some properties of f(x) = 
log(1/x), with log to any base > 1: f is a decreasing, non-negative function on 
(0, 1], and f(1) = 0. These seem to be necessary properties for a function to be 
used to quantify information, via the equation /(E) = f(P(E)). Since [(E) 
is to be a quantity, it should be non-negative. The smaller the prior probability 
of an event the greater the quantity of information released when it occurs, so 
f should be decreasing. And an event of prior probability 1 should release no 
information at all when it occurs, so f(1) should be 0. 

Even among functions definable by elementary formulas, there are an infi- 
nite number of functions on (0, 1] satisfying the requirements noted above; for 
instance, 1 — x4 and (1/x)? — 1 satisfy those requirements, for any g > 0. One 
advantage that log(1/x) has over these functions is that it converts products to 
sums, and a lot of products occur in the calculation of probabilities. As we shall 
see in section 2.1.3, this facile, shallow observation in favor of log(1/x) as the 
choice of function to be used to quantify information is remarkably close to the 
reason why log(1/x) is the only possible choice for that purpose. 
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2.1.1 Naming the units 


For any a,b > 0,a 414), and x > 0, log, x = (log, b) log, x; that is, the 
functions logx to different bases are just constant multiples of each other. So, 
in Shannon’s use of log in the quantification of information, changing bases is 
like changing units. Choosing a base amounts to choosing a unit of information. 
What requires discussion is the name of the unit that Shannon chose when the 
base is 2: Shannon chose to call that unit a Dit. 

Yes, the unit name when log = log, is the very same abbreviation of “binary 
digit” widely reported to have been invented by J. W. Tukey, who was at Bell 
Labs with Shannon in the several years before [63] appeared. (In [76] we read 
that the word “bit”, with the meaning of “binary digit’, first appeared in print 
in “A mathematical theory of communication.”) 

Now, we do not normally pay much attention to unit names in other con- 
texts. For example, “square meter” as a unit of area seems rather self-explan- 
atory. But in this case the connection between “bit” as a unit of information, 
an arbitrarily divisible quantifiable substance, like a liquid, and “bit” meaning 
a binary digit, either O or 1, is not immediately self-evident to human intuition; 
yet Shannon uses the two meanings interchangeably, as has virtually every other 
information theorist since Shannon (although Solomon Golomb, in [24], is care- 
ful to distinguish between the two). We shall attempt to justify the unit name, 
and, in the process, to throw light on the meaning of the information unit when 
the base of the logarithm is a positive integer greater than 2. 

Think of one square meter of area as the greatest amount of area that can be 
squeezed into a square of side length one meter. (You may object that when one 
has a square of side length 1 meter, one already has a maximum area “squeezed” 
into it. Fine; just humor us on this point.) Reciprocally, the square meter mea- 
sure of the area of a planar region is the side length, in meters, of the smallest 
square into which the region can hypothetically be squeezed, by deformation 
without shrinking or expanding (don’t ask for a rigorous definition here!). 

With this in mind, let us take, in analogy to a planar region, an entire prob- 
abilistic experiment, initially unanalyzed as to its possible outcomes; and now 
let it be analyzed, the possible outcomes broken into a list F),..., Em of pair- 
wise mutually exclusive events which exhaust the possibilities: P(U?, Ej) = 
yey P(E;) = 1 (recall 1.2.3). If you wish, think of each Ej as a single out- 
come, in a set of outcomes. Assume P(E;) > 0,i =1,...,m. 

It may be objected that rather than analogizing a planar region by an entire 
probabilistic experiment, a planar region to which the quantity area is assigned 
should be analogous to the kind of thing to which the quantity information is 
assigned, namely a single event. This is a valid objection. 

In what follows, rather than squeezing the information contained in a sin- 
gle event into a “box” of agreed size, we will be squeezing the information 
contained in the ensemble of the events F),..., Ej; into a very special box, the 
set of binary words of a certain length. We will compare the average informa- 
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tion content of the events F),..., Ey with that length. This comparison will 
be taken to indicate what the maximum average (over a list like E1,..., Em) 
number of units of information can be represented by the typical (aren’t they 
all?) binary word of that length. 

We admit that this is all rather tortuous, as a justification for terminology. 
Until someone thinks of something better, we seem to be forced to this approach 
by the circumstance that we are trying to squeeze the information content of 
events into binary words, whereas, in the case of area, we deform a region 
to fit into another region of standard shape. If we considered only one event, 
extracted without reference to the probabilistic experiment to which it is asso- 
ciated, we could let it be named with a single bit, 0 or 1, and this does not seem 
to be telling us anything. Considering a non-exhaustive ensemble of events as- 
sociated with the same probabilistic experiment (pairwise mutually exclusive 
so that their information contents are separate) we have a generalization of the 
situation with a single event; we can store a lot of information by encoding with 
relatively short binary words, just because we are ignoring the full universe 
of possibilities. Again, this does not seem to lead to a satisfactory conclusion 
about the relation between information and the length of binary words required 
to store it. What about looking at ensembles of events from possibly different 
probabilistic experiments? Again, unless there is some constraint on the num- 
ber of these events and their probabilities, it does not seem that encoding these 
as binary words of fixed length tells us anything about units of information, any 
more than in the case when the events are associated with the same probabilistic 
experiment. 

We realize that this discussion is not wholly convincing; perhaps some- 
one will develop a more compelling way of justifying our setup in the future. 
For now, let us return to E),..., Em, pairwise mutually exclusive events with 
ae P(E;) = 1. If we agree that /(Z) = —log P(E) for any event E, with log 
to some base > 1, then the average information content of an event in the list 
E\,..., Em 1s 


H(E},..., Em) = > P(E) 1 (Ei) = — >| P(Ei) log P(E}). 
1 1 


(Recall Section 1.8.) 
As is conventional, let In = log,, the natural logarithm. 


2.1.1 Lemma For x > 0,Inx < x —1, with equality when and only when 
x=1. 


Indication of proof Apply elementary calculus to f(x) = x —1—Inx to see 
that f(x) > 0 on (0, o©), with equality only when x = 1. 


2.1.2 Theorem If p,,..., Pm are positive numbers summing to 1, then 
—>~""_, pilog pi < logm, with equality if and only if pj =1/m,i =1,...,m. 
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Proof: Let c = loge > 0. Since > pj = 1, 
m 


(—)- pilog pi) —logm = 5 pideedl/pp —logm) 
i=1 i=l 
m 


= YF pilog(i/(mpi)) 


piln(1/(mpi)) seY pi 1) 


i=l i=l 


= et ip =e) =c(1—1)=0 


i= i=1 


by Lemma 2.1.1, with equality if and only if 1/(@mp;) = 1 foreachi = 1,...,m 
oO 


Now back to considering E),..., Em. Let k be an integer such that m < ok 
and let us put the £; in one-to-one correspondence with m of the binary words 
of length k. That is, the E; have been encoded, or named, by members of {0, 1}* : 
and thereby we consider the ensemble F),..., Ej to be stored in {0, 1. 

By Theorem 2.1.2, the average information content among the £; satisfies 
H(E\,..., Em) =— do, P(E;) log P(E;) < logm < log2* =k, if log = log); 
and equality can be achieved if m = 2k and P(E;) = 1/m,i=1,...,m. Thatis, 
the greatest average number of information units per event contained in a “sys- 
tem of events’’, as we shall call them, which can be stored as k-bit binary words, 
is k, if the unit corresponds to log = log,. And that, ladies and gentlemen, is 
why we call the unit of information a bit when log = log). 

In case log = log,,, for an integer n > 2, we would like to call the unit of 
information a nit, but the term probably won’t catch on. Whatever it is called, 
the discussion preceding can be adapted to justify the equivalence of the infor- 
mation unit, when log = log,,, and a single letter of an n-element alphabet. 


(a 


2.1.2 Information connecting two events 


Let log = log, for some b > 1, and suppose that E and F are events in the same 
probability space (i.e., associated with the same probabilistic experiment). 
If P(F) > 0, the conditional information contained in E, conditional upon 
F, denoted I(E | F), is 
P(ENF) 
I(E | F)=—log P(E | F) = —log ——— 
P(F) 
If P(EN F) =0 we declare [(F | F) = 
The mutual information of (or between) E and F, denoted I(E, F), is 
P(ENF) 
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and I(E, F) = 0 otherwise, i.e., if either P(E) = 0 or P(F) = 0. 

If Shannon’s quantification of information is agreed to, and if account is 
taken of the justification of the formula for conditional probability given in Sec- 
tion 1.3, then there should be no perplexity regarding the definition of /(E | F). 
But it is a different story with /(E, F). For one thing, /(E, F) can be positive 
or negative. Indeed, if P(E) P(F) > 0 and P(EN F) = 0, we have no choice 
but to set /(E, F) = —ov; also, /(E, F) can take finite negative values, as well 
as positive ones. 

We do not know of a neat justification for the term “mutual information”, 
applied to 7(E, F), but quite a strong case for the terminology can be built on 
circumstantial evidence, so to speak. The mutual information function and the 
important index based on it, the mutual information between two systems of 
events, to be introduced in Section 2.2, behave as one would hope that indices 
so named would behave. As a first instance of this behavior, consider the fol- 
lowing, the verification of which is left as an exercise. 


2.1.3 Proposition /(E, F) =0 if and only if E and F are independent events. 


(i 


2.1.3 The inevitability of Shannon’s quantification of information 


Shannon himself provided a demonstration (in [63]) that information must be 
quantified as he proposed, given that it is to be a quantity attached to random 
phenomena, and supposing certain other fundamental premises about its behav- 
ior. His demonstration was mathematically intriguing, and certainly contributed 
to the shocked awe with which “A mathematical theory of communication” was 
received. However, after the initial astonishment at Shannon’s virtuosity wears 
off, one notices a certain infelicity in this demonstration, arising from the ab- 
struseness of those certain other fundamental premises referred to above. These 
premises are not about information directly, but about something called entropy, 
defined in Section 2.3 as the average information content of events in a system of 
events. [Yes, we have already seen this average in Section 2.1.1.] Defined thus, 
entropy can also be regarded as a function on the space of all finite probability 
vectors, and it is as such that certain premises—we could call them axioms— 
about entropy were posed by Shannon. He then showed that if entropy, defined 
with respect to information, is to satisfy these axioms, then information must be 
defined as it is. 

The problem with the demonstration has to do with our assent to the ax- 
ioms. This assent is supposed to arise from a prior acquaintance with the word 
“entropy,” connoting disorder or unpredictability, in thermodynamics or the ki- 
netic theory of gases. Even supposing an acquaintance with entropy in those 
contexts, there are a couple of intellectual leaps required to assent to Shannon’s 
axioms for entropy: why should this newly defined, information-theoretic en- 
tropy carry the connotation of the older entropy, and how does this connotation 
translate into the specific axioms set by Shannon? 
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The demonstration of Feinstein [17] is of the same sort as Shannon’s, with 
a somewhat more agreeable set of axioms, lessening the vertigo associated with 
the second of the intellectual leaps mentioned above. The first leap remains. 
Why should we assent to requirements on something called entropy just because 
we are calling it entropy, a word that occurs in other contexts? 

Here are some requirements directly on the function f appearing in /(E) = 
f(P(E)) enunciated by Aczél and Daroczy [1]: 


Gi) f(x) = 0 for all x € (0, 1); 
Gi) f(x) > 0 for all x € (0, 1); and 


(ii) f(pq) = f(p) + F@) forall p,q € (0, 1]. 


Requirements (i) and (ii) have been discussed in section 2.1.1. Notice that 
there is no requirement that f be decreasing here. 

Obviously requirement (iii) above is a strong requirement, and deserves 
considerable comment. Suppose that p,q € (0, 1]. Suppose that we can find 
events E and F, possibly associated with different probabilistic experiments, 
such that P(E) = p and P(F) = q. (We pass over the question of whether 
or not probabilistic experiments providing events of arbitrary prior probability 
can be found.) Now imagine the two-stage experiment consisting of performing 
copies of the experiments associated with E and F independently. Let G be the 
event “E occurred in the one experiment and F occurred in the other’. Then, 
as we know from section 1.3, P(G) = pq, so 1(G) = f (pq). 

On the other hand, the independence of the performance of the probabilistic 
experiments means that the information given off by the occurrence of F in one, 
and F in the other, ought to be the sum of the information quantities disclosed 
by the occurrence of each separately. This is like saying that the area of a region 
made up of two non-overlapping regions ought to be the sum of the areas of the 
constituent regions. Thus we should have 


f (pq) = 1(G) = ME) + 1(F) = f(p) + £@). 


We leave it to the reader to scrutinize the heart of the matter, the contention 
that because the two probabilistic experiments are performed with indifference, 
or obliviousness, to each other, the information that an observer will obtain 
from the occurrences of FE and F, in the different experiments, ought to be 
I(E)+](F). We make the obvious remark that if you receive something—say, 
money—from one source, and then some more money from a totally different 
source, then the total amount of money received will be the sum of the two 
amounts received. 

We achieve the purpose of this subsection by proving a slightly stronger 
version of Aczél and Daroczy’s result, that if f satisfies (i), (ii), and (iii), above, 
then f(x) = —log,x for some b > | for all x € (0, 1]. 


2.1.4 Theorem Suppose that f is a real-valued function on [0, 1) satisfying 
(a) f(x) = 0 forall x € (0, 1]; 
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(b) f(a) > 0 for some a e€ (0, 1]; and 


(c) f(pq) = f(p) + f(q@) forall p,g € (0, 1). 
Then for some b > 1, f(x) = —log,x for all x € (0, 1]. 


Proof: First we show that f is monotone non-increasing on (0, 1]. Suppose 
that 0<x <y <1. Then f(x) = f(y) = f(y) + f(x/y) = f(), by (@) and 
(c). 

Now we use a standard argument using (c) alone to show that f(x") = 
rf (x) for any x € (0, 1] and any positive rational r. First, using (c) repeatedly, 
or, if you prefer, proceeding by induction on m, it is straightforward to see that 
for each x € (0, 1] and each positive integer m, f(x”) = mf (x). Now suppose 
that x € (0, 1] and that m and n are positive integers. Then x’”,x’”"/” © (0, 1], 
and 


mf (x) = f@™ = f(a") =nfo™, 


sof (x/") = ™ f(x). 

By (c), fd) = fd)+ fd), so fC) = 0. Therefore the a mentioned in 
(b) is not 1. As b ranges over (1,00), —log,a = ae ranges over (0,00). 
Therefore, for some b > 1, —log,a = f(a). By the result of the paragraph 
preceding and the properties of log,, the functions f and — log, agree at each 
point aw’, r a positive rational. 

The set of such points is dense in (0, 1]. An easy way to see this is to note 
that Ine” =rIna, so {Ina’; r is a positive rational} is dense in (—0o, 0), by the 
well-known density of the rationals in the real numbers; and the inverse of In, 
the exponential function, being continuous and increasing, will map a dense set 
in (—oo, 0) onto a dense set in (0, 1]. 

We have that f and — log, are both non-increasing, they agree on a dense 
subset of (0, 1], and —log, is continuous. We conclude that f = —log, on 
(0, 1]. The argument forcing this conclusion is left as an exercise. O 


Exercises 2.1 


1. Regarding the experiment described in Exercise 1.3.5, let E4 = “urn A 
was chosen” and F, = “a green ball was drawn”. Write explicitly: (a) 
I(Ea, Fg); (b) I(Ea | Fg); (©) [Fe | Ea). 

2. Suppose that F is an event in some probability space, and P(E) > 0. Show 
that /(E, FE) = I(E), and that /(£ | E) =0. 


3. Suppose that E and F are events in some probability space. Show that 
I(E, F) = Oif and only if E and F are independent. Show that /(E, F) = 
I(£)-I(E | F), if P(E) P(F) > 0. Show that /(E, F) < min(/(£), I(F)). 
Show that /(Z, F) = I(F) if and only if E is essentially contained in F, 
meaning, P(E \ F) =0. 


4. Fill in the proof of Lemma 2.1.1. 
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5. Suppose that pj,..., Pn, Gi,---,n are positive numbers and >; pi=1l= 
>; gi- Show that 7", pi log(1/qi) < 0", pilog(1/ pi) with equality if 
and only if pj = qgj,i =1,...,n. [Hint: look at the proof of Theorem 2.1.2.] 

6. Show that if f and g are monotone non-increasing real-valued functions on 
a real interval J which agree on a dense subset of 7, and g is continuous, 
then f = g on J. Give an example to show that the conclusion is not valid 
if the assumption that g is continuous is omitted. 


2.2 Systems of events and mutual information 


Suppose that (S, P) is a finite probability space. A system of events in (S, P) is 
a finite indexed collection € =[E;;i € I] of pairwise mutually exclusive events 
satisfying 1 = P(U,<, Ei). 


Remarks 


2.2.1 When | = P(),-; Ei), it is common to say that the E; exhaust S. 


ie] 
2.2.2 Note that if the E; are pairwise mutually exclusive, then P(Uj<; Ei) = 
ier PCED, by 1.2.3. 


2.2.3. Any partition of S is a system of events in (S, P) (see exercise 1.1.2), 
and partitioning is the most obvious way of obtaining systems of events. For 
instance, in the case of n Bernoulli trials, with S = {S, F}”, if we take Ex, = 
“exactly k successes,” then Eo,..., Ey partition S. 

It is possible to have a system of events in (S, P) which does not partition 
S only when S contains outcomes with zero probability. Just as it is convenient 
to allow outcomes of zero probability to be elements of sets of outcomes, it 
is convenient to allow events of zero probability in systems of events. One 
aspect of this convenience is that when we derive new systems from old, as we 
shall, we do not have to stop and weed out the events of probability zero in the 
resultant system. 

In deriving or describing systems of events we may have repeated events 
in the system, E; = E; for some indices i # j. In this case, P(E;) = 0. For 
better or for worse, the formality of defining a system of events as an indexed 
collection, rather than as a set or list of events, permits such repetition. 


2.2.4 If € =[E;;i € I] is a system of events in (S, P), we can take € as a new 
set of outcomes. The technical niceties are satisfied since 1 = P (Use fi) = 
»; <; P(E;); in viewing E as a set we regard the E; as distinct, even when they 
are not. 

Taking € as a new set of outcomes involves a certain change of view. The 
old outcomes are merged into “larger” conglomerate outcomes, the events E;. 
The changes in point of view achievable in this way are constrained by the 
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choice of the original set of outcomes, S. Notice that S itself can be thought of 
as a system of events, if we identify each s € S with the event {s}. 

If F is a system of events in (S, P), and we choose to view F as a set of 
outcomes, then we can form systems of events in the “new” space (F, P) just 
as F is formed from S. Systems of events in (F, P) are really just systems in 
(S, P) that bear a certain relation to F. 


2.2.5 Definition Suppose € = [E;;i ¢ J] and F = [F;; j € J] are systems of 
events in a finite probability space (S, P); we say that € is an amalgamation of 
F in case for each j € J there is some i ¢€ J such that P(E; 1 F;) = P(Fj). 


2.2.6 Definition If (S, P) is a finite probability space and E, F C S, we will 
say that F C E essentially if P(F \ E) =0, and F = E essentially if F C E 
essentially and E C F essentially. 


To make sense of Definition 2.2.5, notice that P(E; Fj) = P (Fj) is equiv- 
alent to P(F; \ (E; 0 F;)) = 0, which means that F; is contained in E;, except, 
possibly, for outcomes of zero probability. So, the condition in the definition 
says that each F; is essentially contained in some E;. By Corollary 2.2.8, be- 
low, P(Fj) = Drei P(E; F;) for each j € J, soif0 < P(F;) = P(E, Fj) 
for some ig € J, then ig is unique and P(E; 1 F;) = 0 for alli ¢ J, i Aig. This 
says that, when € is an amalgamation of F, each F; of positive probability is 
essentially contained in exactly one £;. Since the F; are mutually exclusive and 
essentially cover S, it also follows that each E£; is essentially (neglecting out- 
comes of zero probability) the union of the Fj it essentially contains; i.e., the 
E; are obtained by “amalgamating” the F’; somehow. (Recall exercise 1.1.2.) 

Indeed, the most straightforward way to obtain amalgamations of F is as 
follows: partition J into non-empty subsets Jj,..., J, and set E = [E},..., 
Ex], with E; = Uj; EJ; F;. It is left to you to verify that € thus obtained is an 
amalgamation of ¥. We shall prove the insinuations of the preceding paragraph, 
and see that every amalgamation is essentially (neglecting outcomes of proba- 
bility zero) obtained in this way. This formality will also justify the interpreta- 
tion of an amalgamation of F as a system of events in the new space (Ff, P), 
treating F as a set of outcomes. Readers who already see this interpretation and 
abhor formalities may skip to 2.2.10. 


2.2.7 Lemma If (S, P) is a finite probability space, E, F C S, and P(F) = 1, 
then P(E) = P(EN F). 


Proof: P(E) = P(ENF)+P(EN(S\ F)) = P(ENF)+0, since EN(S \ 


F) CS\F and P(S\ F)=1—P(F)=1-1=0. oO 
2.2.8 Corollary If ¥ =[F;; j € J] is a system of events in(S, P), andE CS, 
then P(E) = Des P(E(\F;). 


Proof: Suppose j1, j2 € J and j, 4 jo. We have (EN Fj, )O(EN Fj) © Fj, 
Fj,,800< P( ENF), )N(EN Fj,)) < PUR}, 1 Fj,) = 0 since Fj, and Fj, are 
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mutually exclusive. Thus EM Fj, and EM Fj, are mutually exclusive. It follows 
that )i icy P(EN Fj) = i aaa )= P(EN Ujey Fj) = P(E) by 


2.2.7, taking F = Ujer Fi oO 


2.2.9 Theorem Suppose that F = [F;; j € J] is a system of events in a finite 
probability space (S, P), and I is a finite non-empty set. An indexed collection 
€ =[E;j;i € 1] of subsets of S is an amalgamation of ¥ if and only if there is 
a partition [Jj;i € I] of J (into not necessarily non-empty sets) such that, for 
eachi € 1, E; = Ujes; F; essentially. 


Proof: The proof of the “if” assertion is left to the reader. Note that as part of 
this proof it should be verified that € is a system of events. 

Suppose that € is an amalgamation of F. If i ¢ J and P(E;) > 0, set 
Ji ={j ¢ J; P(E; 1 F;) > O}. If 7 € J and P(F;) > 0, then there is a unique 
ig € I such that j € Jj,, by the argument in the paragraph following Definition 
2.2.6. Note that P(F;) = P(F; 1 E;) if j € Jj, by that argument. 

Thus we have pairwise disjoint sets J; C J containing every j € J such 
that P(F;) > 0. If P(F;) = 0, put j into one of the J;, it doesn’t matter which. 
If P(E;) =0, set J; = @. The Jj, i € 7, now partition J. It remains to be seen 
that E; = U; red Fy essentially for each i € J. This is clear if P(E;) = 0. If 
P(E;) > 0, hen, since P(E; F;) > 0 only for j € Ji, we have 


P(E;) = > P(E; F;) [by Corollary 2.2.8] 


jeJ 

=> PG; NF) = P(U (E;NFj)), so 
jedi fo 

P(E;\ U Fj) =P(Ei\ U (£iN Fj)) 
jedi Jedi 
=P(E;)— P(\U (Ein F;)) =0 
Jedi 
On the other hand, 
P(E;)=)_ P(EINF)) = >, PCF) 

jedi jedi 


implies that 
P(U Ep EE (Fj \ (Ei Fj))) 


jedi Jedi 
= DUPE) ~ P(EIN Fj) =0 
Jedi 
Thus £; = rex F;, essentially. O 


When € is an amalgamation of F, we say that € is coarser than F, and F 
is finer than €. Given € and Ff, neither necessarily coarser than the other, there 
is an obvious way to obtain a coarsest system of events which is finer than each 
of €, F. (See Exercise 2.2.2.) 
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2.2.10 Definition Suppose that € =[£;;i ¢ 7] and ¥ =[F;; j € J] are systems 
of events in a finite probability space (S, P). The joint system associated with 
E and Ff, denoted € A fF, is 


EAF=[E; NF); Gj) €1x Ji. 
E AF is also called the join of € and F. 


2.2.11 Theorem If € and F are systems of events in (S, P) then E AF is a 
system of events in (S, P). 


Proof: If (, j),@’, 7) € 1x J and (i, j) 4 W, j’), then either i 4 i’ or j F j’. 
Suppose that i 4 i’. Since (Ej; Fj) (EN F‘) Cc E; NE’, we have 


0< PU(E;NF)N(E SN F‘)) SPU ie a0; 


so E; 0 Fj and EN F are mutually exclusive. The case i = i’ but j # j’ is 
handled symmetrically. 
Next, since mutual exclusivity has already been established, 


P( LU Gink))= 2 PUE:INF) 


Gi, jyelxJ Gi, jyelxJ 
=) >) PGF) 
iel jes 
= > P(Ej) [by Corollary 2.2.8] 
iel 
=f, oO 


Definition Suppose that € = [E;;i €¢ J] and F = [Fj; j € J] are systems of 
events in some finite probability space (S, P);€ and F are statistically inde- 
pendent if and only if EF; and F; are independent events, for each i € J and 
JéeJ. 


Statistically independent systems of events occur quite commonly in asso- 
ciation with multistage experiments in which two of the stages are “indepen- 
dent” —i.e., outcomes at one of the two stages do not influence the probabilities 
of the outcomes at the other. For instance, think of an experiment consisting of 
flipping a coin, and then drawing a ball from an urn, and two systems, one con- 
sisting of the two events “heads came up”, “tails came up”, with reference to the 
coin flip, and the other consisting of the events associated with the colors of the 
balls that might be drawn. For a more formal discussion of stage, or component, 
systems associated with multistage experiments, see Exercise 2.2.5. 


2.2.12 Definition Suppose that € =[£;;i ¢/] and ¥ =[F;j; j € J] are systems 
of events in a finite probability space (S, P). The mutual information between 
E and F is 


IE FS) > PUG OF) A). 


iel jes 
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In the expression for /(€, F) above, /(E;, F;) is the mutual information 


between events ; and Fj as defined in Section 2.1.2: [(£;, Fj) = log i 
if P(E;)P(F;) > 0, and [(£;, F;) = 0 otherwise. If we adopt the convention 


that 0log(anything) = 0, then we are permitted to write 
P(E;| NF i) : 


1C.Fy=)>_ > Pe; ME See pa 
I J 


iel jes 
this rewriting will turn out to be a great convenience. 

The mutual information between two systems of events will be an ex- 
tremely important parameter in the assessment of the performance of commu- 
nication channels, in Chapters 3 and 4, so it behooves us to seek some justifi- 
cation of the term “mutual information.” Given € and F, we can think of the 
mapping (i, j) > [(£;, F;) as a random variable on the system € A F, and we 
then see that /(€, F) is the average value of this random variable, the “mutual 
information between events” random variable. This observation would justify 
the naming of /(€, F), if only we were quite sure that the mutual-information- 
between-events function is well named. It seems that the naming of both mutual 
informations, between events and between systems of events, will have to be 
justified by the behavior of these quantities. We have seen one such justification 
in Proposition 2.1.3, and there is another in the next Theorem. This theorem, by 
the way, is quite surprising in view of the fact that the terms in the sum defining 
I(€,F) can be negative. 


2.2.13 Theorem Suppose that € and F are systems of events in a finite proba- 
bility space. Then I(€, F) > 0 with equality if and only if € and F are statisti- 
cally independent. 


Proof: Let c = loge, so that logx = cInx for allx > 0. Ifi e J, j € J, and 
P(E; F;) > 0, we have, by Lemma 2.1.1, 


P(Ej) P(Fj) scP(Ein ky | OED 7 
P(EiNFj) — ET P(E NF;) 
= c[P(Ei)P(Fj) — P(EiN Fj)] 


Sa = 1, ie., if and only if E; and F; 


are independent events. If P(E; Fj) = 0 then, using the convention that 
Olog(anything) = 0, we have 


P(E;)P(Fj) a2. a: 
P(E;N Fi) <c[P(E;) P(Fj) — PCE N Fj], 


again, with equality if and only if P(E;)P(F;) =0 = P(E; F;). Thus 


P(E; 0 F;) log 


with equality if and only if 


0= P(E; N F;) log 
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P(E;)P(Fj) 


—I(E,F) =) >) P(E NF;) log PCE) 
i j 


iel jes 


<c)) (P(E) P(F)) — P(EiN Fj) 


iel jets 


= OPEL Pe)- LY Pen A] 


iel jes iel jets 
=c[l-—1]=0 [note Theorem 2.2.11] 


with equality if and only if E; and F; are independent, foreachi € J, j ¢ J.O 


Exercises 2.2 
1. You have two fair dice, one red, one green. You roll them once. We can 


make a probability space referring to this experiment in a number of differ- 
ent ways. Let 


S| ={‘i appeared on the red die, j on the green” ;i, j € {1,..., 6}}, 
S2 ={‘i appeared on one of the dice, and j on the other” ; 
i,j €{1,..., 6}}, 
S3 = {“the sum of the numbers appearing on the dice was k”; 
k € {2,...,12}}, and 
S4 = {“even numbers on both dice”, “even on the red, odd on the green”, 


“even on the green, odd on the red”, “odd numbers on both dice”}. 


Which pairs of these sets of outcomes have the property that neither is an 
amalgamation of the other? 


2. Suppose that € and F are systems of events in a finite probability space. 


(a) Prove that each of €, F is an amalgamation of € A F. [Thus, € A F is 
finer than each of €, F.] 

(b) Suppose that each of €, F is an amalgamation of a system of events G. 
Show that € A F is an amalgamation of G. [So € A F is the coarsest 
system of events, among those that are finer than each of €, ¥.] Here 
is a hint for (b): Suppose that E, F, and G are events in €,F and G, 
respectively, and P(GN ENF) >0. Then P(GNE), P(GNF) > 0. 
By the assumption that €, F are amalgamations of G and an argument 
in the proof of Theorem 2.2.9, it follows that G is essentially contained 
in E, andin F. So... 


3. Two fair dice, one red, one green, are rolled once. Let € = [F},..., Ee], 
where FE; = “i came up on the red die”, and F = [F»,..., Fi2], where 
F; = “the sum of the numbers that came up on the two dice was j”. Write 
I(€, F) explicitly, in a form that permits calculation once a base for “log” 
is specified. 
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4. Regarding the experiment described in Exercise 1.3.5, let Ey = “urn U 
was chosen”, U € {A, B,C}, F, = “ared ball was drawn”, F', = “a green 
ball was drawn”, € = [E,4, Ep, Ec], and F = [F,, Fg]. Write 1(€,F) 
explicitly, in a form that permits calculation, given a base for “log”. 

(i) (i) 


5. Suppose we have a k-stage experiment, with possible outcomes x,"’, ... , Xn; 
at the ith stage, i=1,...,k. Letus take, as we often do, S = (Com sy mn); 
1< ji <nj,i=1,...,k}, the set of all sequences of possible outcomes at 


the different stages. 
For 1<i<k,1l<j<nj, let ES = oe occurred at the ith stage”, and 


E® =[E; j =1,...,ni]. We will call €© the ith stage, or ith compo- 
nent, system of events in (S, P). Two stages will be called independent if 
and only if their corresponding component systems are statistically inde- 
pendent. 


In the case k = 2, let us simplify notation by letting the possible outcomes at 
the first stage be x1,..., x, and at the second stage y1,..., Ym. Let Ej = “x; 
occurred at the first stage”, i = 1,...,n, Fj; = “yj occurred at the second 
stage”, j = 1,...,m, be the events comprising the component systems & 
and F, respectively. 


(a) Let pj = P(E;) and qj; = P(F; | Ej),i=1,...,n, j=1,...,m, asin 
Section 1.3. Assume that each p; is positive. Show that € and F are 
statistically independent if and only if the g;; depend only on j, not i. 
(That is, for any i1,i2 € {1,...,n} andj € {1,...,m}, gi, j =4in;-) 

(b) Verify that if S is regarded as a system of events in the space (S, P) 
(i.e., identify each pair (x;,y;) with the event {(x;,yj;)}), then 
HCAS: 

*(c) Suppose that three component systems of a multistage experiment, say 
EYE, and E®), are pairwise statistically independent. Does it fol- 
low that they are jointly statistically independent? This would mean 
that for all | <i <j, 1 <j <n, and 1 <k <n3, P(E” ae? n 
EP) = P(E;?) P(E; P(E”): Take note of Exercise 1.4.7. 

6. (a) Suppose that Fj,..., Ex, Fi,..., F, are events in a finite probability 
space (S, P) satisfying 


(i) Ej,..., Ex are pairwise mutually exclusive; 
(ii) F\,..., F; are pairwise mutually exclusive; and 
(iii) foreachi € {1,...,k}, 7 €{1,...,r}, E; and F; are independent. 


Show that Bar E; and ee , F; are independent. 


(b) Suppose that €, EF , and F are systems of events in some finite prob- 
ability space, and € and F are amalgamations of € and Ff, respectively. 
Show that if € and F are statistically independent, then so are E and 
F. [You did the hard work in part (a).] 
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(c) Itis asserted at the end of Section 1.4 that “when two stages [of a multi- 
stage experiment] are independent, any two events whose descriptions 
involve only those stages, respectively, will be independent.” Explain 
how this assertion is a special case of the result of 6(a), above. 


7. Succinctly characterize the systems of events € such that 1(€,€) =0 


8. Suppose that € and F are systems of events in some finite probability space, 
and that € is an amalgamation of €. Show that [(€, F) < I(€,F). [You 
may have to use Lemma 2.1.1.] 


2.3 Entropy 


Suppose that € =[E£;;i € J] and F =[F;; j € J] are systems of events in some 
finite probability space (S, P). The entropy of E, denoted H(€), is 


H(€)= -)> P(E;) log P(E}). 
iel 
The joint entropy of the systems E and F is the entropy of the joint system, 
H(EAF)= —SOV PE: N F;)log P(E; Fj). 
iel jeJ 


The conditional entropy of €, conditional upon Ff, is 


H(E|F)=)_ >) PEIN FUE: | Fp) 


iel jes 
P(E;/NFy) 
=-)°)> P(E; N Fi) log 
~ P(Fj) ~ 
iel jes 


Remarks 


2.3.1 In the definitions above, we continue to observe the convention that 
Olog(anything) = 0. As in the preceding section, the base of the logarithm 
is unspecified; any base greater than 1 may be used. 


2.3.2. Taking € as a new set of outcomes for the probability space, we see that 
H (E) is the average value of the self-information of the events (now outcomes) 
in €. Similarly, the joint and conditional entropies are average values of certain 
kinds of information. 


2.3.3. The word entropy connotes disorder, or uncertainty, and the number H (€) 
is to be taken as a measure of the disorder or uncertainty inherent in the system 
€. What sort of disorder or uncertainty is associable with a system €, and why 
is H(€), as defined here, a good measure of it? 
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A system of events represents a way of looking at an experiment. The pesky 
little individual outcomes are grouped into events according to some organizing 
principle. We have the vague intuition that the more finely we divide the set 
of outcomes into events, the greater the “complexity” of our point of view, and 
the less order and simplicity we have brought to the analysis of the experiment. 
Thus, there is an intuitive feeling that some systems of events are more complex, 
less simple, than others, and it is not too great a stretch to make complexity a 
synonym, in this context, of disorder or uncertainty. 

As to why H is a suitable measure of this felt complexity: as with mutual 
information, the justification resides in the behavior of the quantity. We refer 
to the theorem below, and to the result of Exercise 4 at the end of this section. 
The theorem says that H(€) is a minimum (zero) when and only when € con- 
sists of one big event that is certain to occur, together, possibly, with massless 
events (events with probability zero). Surely this is the situation of maximum 
simplicity, minimum disorder. (A colleague has facetiously suggested that such 
systems of events be called “Mussolini systems”. The reference is to the level 
of order in the system; in correlation to this terminology, an event of probability 
one may be called a Mussolini.) The theorem also says that, for a fixed value 
of |E| = |7|, the greatest value H (€) can take is achieved when and only when 
the events in € are equally likely. This taxes the intuition a bit, but it does seem 
that having a particular number of equiprobable events is a more “uncertain” 
or “complex” situation than having the same number of events, but with some 
events more likely than others. = 

The result of Exercise 2.3.4 is that if € is obtained from € by amalgamation, 
then H (€) < H(E). To put this the other way around, if you obtain a new system 
from an old system by dividing the old events into smaller events, the entropy 
goes up, as it should. 

Shannon ( [63] and [65]) introduced an axiom system for entropy, a series 
of statements that the symbol H ought to satisfy to be worthy of the name 
entropy, and showed that the definition of entropy given here is the only one 
compatible with these requirements. As previously mentioned, Feinstein [17] 
did something similar, with (perhaps) a more congenial axiom system. For an 
excellent explanation of these axioms and further references on the matter, see 
the book by Dominic Welsh [81] or that of D. S. Jones [37]. We shall not pursue 
further the question of the validity of the definition of H, nor its uniqueness. 


2.3.4 Theorem Suppose that € = [Ej;i € I] is a system of events in a finite 
probability space. Then 0 < H(€) < log|/|. Equality at the lower extreme oc- 
curs if and only if all but one of the events in € have probability zero. [That 
one event would then be forced to have probability 1, since )°;-; P(Ei) = 
P(Uje; Ej) = 1.] Equality occurs at the upper extreme if and only if the events 
in € are equally likely. [In this case, each event in € would have probability 


1/|7|.] 


Proof: It is straightforward to see that the given conditions for equality at the 
two extremes are sufficient. 
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Since 0 < P(E;) < 1 foreachi € J, —P(E;)log P(E;) => 0 with equality if 
and only if either P(£;) = 0 (by convention) or P(£;) = 1. Thus H (€) > 0, and 
equality forces P(E;) = 0 or | for each 7. Since the E; are pairwise mutually 
exclusive, and )°; P(£;) = 1, H(E) = 0 implies that exactly one of the E; has 
probability | and the rest have probability zero. 

Let c = loge. We have 


1 
H(E)—log|I| =) ° P(E;)log Pea — > P(E) log || 
ie] ier 
=c) > P(E) In(P(E) IL)! 
ie] 
<c) > P(E)(P(E UZ)! = 1) (by Lemma 2.1.1) 
ie] 
= our -D P| =c[1-1]=0, 
ie] iel 
with equality if and only if P(E;)|7| = 1 for eachi € J. O 


The following theorem gives a useful connection between conditional en- 
tropy and the set-wise relation between two systems. Notice that if € is an 
amalgamation of F, then whenever you know which event in F occurred, you 
also know which event in € occurred; i.e., there is no uncertainty regarding €. 


2.3.5 Theorem Suppose that € =[E;;i ¢ I] and F =[Fj; j € J] are systems 
of events in some finite probability space. Then H (€|F) = 0 if and only if € is 
an amalgamation of F . 


Proof: H(E|F) = ier Njes P(EIN Fi)log sa,ary =0 for each i € 


T,jeJ, PUB Fi )log parry = 0, since the terms of the sum above are 
all non-negative. 

If P(F;) =0 then P(E; M Fj) = P(F;) for any choice of i ¢ J. Since 
P(F)) = bier, P(E; F;) by Corollary 2.2.8, if P(F;) > 0 then P(E; F;) > 
0 for some i € J, and then P(E; M F;) log ae = 0 implies P(E; 1 Fj) = 
P(F;). Thus H(€|F) = 0 implies that € is an amalgamation of ¥, and the 
converse is straightforward to see. Oo 


Exercises 2.3 


1. Treating the sets of outcomes as systems of events, write out the entropies 
of each of S},...,S4 in Exercise 2.2.1. 


2. In the experiment of nm independent Bernoulli trials with probability p of 
success on each trial, let E; = “exactly k successes,” and € =[Eo,..., En]. 
Let S = {S, F}”, and treat S as a system of events (i.e., each element of S, 
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regarded as an outcome of the experiment, is also to be thought of as an 
event). Write out both H(€) and H(S). 


3. For a system € of events, show that J(€,€) = H(E€). 
4. (a) Show that, if x1,...,x, > 0, then 


(Ss) loe( > 0 log x;. 
i=l i=1 


i=1 
(b) Show that if E is obtained from € by amalgamation, then H (€) = 
H(€). 


5. Suppose that € =[E;;i € J] and ¥ =[F;; j € J] are systems of events in 
some finite probability space. Under what conditions on € and ¥ will it be 
the case that H(E€ | F) = H(E€)? [See Theorem 2.4.4, next section. ] 


6. Suppose that € and F are systems of events in probability spaces associ- 
ated with two (different) experiments. Suppose that the two experiments 
are performed independently, and the set of outcomes of the compound 
experiment is identified with S, x S2, where S; and S2 are the sets of out- 
comes for the two experiments separately. Let 


E-F=[EXF;E€€,F EF). 


Verify that € - F is a system of events in the space of the compound exper- 
iment. 


Show that H(€-F) = H(€)+ A(F). Will this result hold (necessarily) if 
the two experiments are not independent? 


ES 


2.4 Information and entropy 


Throughout this section, € and F will be systems of events in some finite prob- 
ability space. 
2.4.1 Theorem [(€,7) = H(€)+ H(F)— H(EAF). 


Proof: 
P(E; OF; 
iE ry=> ye 1 Fyylog 
F F I J 
= yy P(E; F;)log P(E; 9 F)) 


J 


—~)°S0 P(E) N Fj) log P(E) — > > P(E; Fj) log P(F)) 
i J i j 


ie 
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= —H(EAF)—)_ P(E;j)log P(E;) 


l 


_ ye P(F;)logP(F;) [using Corollary 2.2.8] 
j 


= -H(EAF)+H(E)+H(P). o 
2.4.2 Corollary I(€,F) < H(€)+H(F). 


2.4.3 Corollary H(E AF) < H(€)+H(F), with equality if and only if E€ and 
F are Statistically independent. 


2.4.4Theorem H(€| F)= H(EAF)—H(F)=H(€)—-I(E,F). 


2.4.5 Corollary H(E | F) < H(E€), with equality if and only if € and F are 
statistically independent. 


2.4.6 Corollary /(€,7) < min(H(€), H(F)). 


Proof: It will suffice to see that /(€,F) < H(E€). This follows from Theorem 
2.4.4 and the observation that H(€ | F) > 0. O 


Notice that Corollary 2.4.6 is much stronger than Corollary 2.4.2. 


Exercises 2.4 
1-4. Prove 2.4.2, 2.4.3, 2.4.4, and 2.4.5, above. 


5. From 2.4.1 and 2.3.4 deduce necessary and sufficient conditions on € and 
F for I(E, F) = H(E)+ H(F). 


6. Express H(E A €) and H(E | €) as simply as possible. 


7. Three urns, A, B, and C, contain colored balls, as follows: 


A contains three red and five green balls, 
B contains one red and two green balls, and 
C contains seven red and six green balls. 


An urn is chosen, at random, and then a ball is drawn from that urn. Let the 
urn names also stand for the event that that urn was chosen, and let R = “a 
red ball was chosen,” and G = “a green ball was chosen.” Let € = {A, B, C} 
and F = {R,G}. Write out 1(€,F), H(€), H(F), H(E AF), H(E|F), 
and H(F | €). If, at any stage, you can express whatever you are trying to 
express in terms of items already written out, do so. 


8. Regarding Corollary 2.4.6: under what conditions on € and F is it in the 
case that /(€,F) = H(€)? 


9. With the urns of problem 7 above, we play a new game. First draw a ball 
from urn A; if it is red, draw a ball from urn B; if the ball from urn A 
is green, draw a ball from urn C. Let € and F be the first and second 
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stage systems of events for this two-stage experiment; i.e., € = [R, G] and 
“3 = [R, G], where, for instance, R = “the first ball drawn was red” and 


R= “the second ball drawn was red.” 


(a) Write out [(€,F), H(E), H(F), H(EAF), H(E|F), and H(F | €) 
in this new situation. 

(b) Suppose now that you are allowed to transfer balls between urns B 
and C. How would you rearrange the balls in those urns to maximize 
I(E, F)? What is that maximum value? 

(c) How would you rearrange the balls in urns B and C to minimize 
I(E€, F)? What is that minimum value? 

(d) Answer the same questions in (b) and (c) with J(€, F) replaced by 
H(E|F). 

(e) Under which of the rearrangements you produced in (b), (c), and (d) is 
€ an amalgamation of #? Under which is ¥ an amalgamation of €? 
Under which are € and F statistically independent? 
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Channels and Channel Capacity 


3.1 Discrete memoryless channels 


A channel is a communication device with two ends, an input end, or trans- 
mitter, and an output end, or receiver. A discrete channel accepts for trans- 
mission the characters of some finite alphabet A = {a,...,a,}, the input al- 
phabet, and delivers characters of an output alphabet B = {b,,...,b,x} to the 
receiver. Every time an input character is accepted for transmission, an output 
character subsequently arrives at the receiver. That is, we do not encompass 
situations in which the channel responds to an input character by delivering 
several output characters, or no output. Such situations may be defined out of 
existence: once the input alphabet and the channel are fixed, the output alpha- 
bet is defined to consist of all possible outputs that may result from an input. 
For instance, suppose A = {0, 1}, the binary alphabet, and suppose that it is 
known that the channel is rickety, and may fuzz the input digit so that the re- 
ceiver cannot tell which digit, 0 or 1, is being received, or may stutter and 
deliver two digits, either of which might be fuzzed, upon the transmission of 
a single digit. Then, with * standing for “fuzzy digit,’ we are forced to take 
B = {0,1,*,00,01, 10, 11, Ox, 1, *0, «1, «x}. 
For a finite alphabet A, we let, as convention dictates, 


A‘ = the Cartesian product of A with itself £ times 
= the set of words of length £, over A. 


Further, let 


[o,@) 
At= U A‘ = the set of all (non-empty) words over A. 
l=1 


Note that if A is the input alphabet of a channel, then any finite non-empty 
subset of A* could be taken as the input alphabet of the same channel. Chang- 
ing the input alphabet in this way will necessitate a change in the output al- 
phabet. For instance, if A = {0,1}, and the corresponding output alphabet is 
B = {0, 1, «}, then if we take A= {00, 11}, the new output alphabet will be 


B = (00,01, Ox, 10, 11, 1, «0, #1, #x}. 


47 
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It is possible to vary the output alphabet by merging or amalgamating let- 
ters; for instance, if B = {0, 1, *, x}, we could take B= {0, 1, a}, with a mean- 
ing “either « or x.” This might be a shrewd simplification if, for instance, the 
original letters « and x are different sorts of error indicators, and the distinction 
is of no importance. 

Another common method of simplifying the output alphabet involves mod- 
ifying the channel by “adding a coin flip.” For instance, if B = {0, 1, *} and you 
really do not want to bother with *, you can flip a coin whenever * is received 
to decide if it will be read as 0 or 1. The coin need not be fair. The same idea 
can be used to shrink B from any finite size down to any smaller size of 2 or 
more. The details of the process depend on the particular situation; they are left 
to the ingenuity of the engineer. See Exercise 3.2.6. 

It may be that there are fundamental input and output alphabets forced upon 
us by the physical nature of the channel; or, as in the case of the telegraph, for 
which the time-hallowed input and output alphabet is {dot, dash} (or, as Shan- 
non [65] has it, {dot, dash, short pause (between letters), long pause (between 
words)}), it may be that some fundamental input alphabet is strongly recom- 
mended, although not forced, by the physical nature of the channel. In the most 
widespread class of examples, the binary channels, the input alphabet has size 
2, and we usually identify the input characters with 0 and 1. Note that, for 
any channel that accepts at least two input characters, we can always confine 
ourselves to two input characters, and thus make the channel binary. 

The telegraph provides a historically fundamental example of a channel; it 
is a somewhat uninteresting, or misleading, example for the student of infor- 
mation theory, because it is so reliable. Over the telegraph, if a “dot” is trans- 
mitted, then a “dot” is received (unless the lines are down), and the same goes 
for “dash.” What makes life interesting in modern times is “channel noise”; 
you cannot be dead certain what the output will be for a given input. Modern 
channels run from outer space, to the ocean floor, to downtown Cleveland—a 
lot can go wrong. Specks of dust momentarily lodge in the receiver, birds fly 
up in front of the transmitter, a storm briefly disrupts the local electromagnetic 
environment—it’s a wonder that successful communication ever takes place. 

We take account of the uncertainty of communication by regarding the at- 
tempt to transmit a single digit as a probabilistic experiment. Before we become 
thoroughly engaged in working out the consequences of this view, it is time to 
announce a blanket assumption that will be in force from here on in this text: 
our channels will all be memoryless. This means that the likelihood of b; being 
the output when a; is the input does not vary with local conditions, nor with re- 
cent history, for each i and j. These unvarying likelihoods are called transition 
probabilities and will be discussed in the next section. 

Please note that this assumption may well be invalid in a real situation. For 
instance, when you hear a “skip” from a record on a turntable,! your estimate 


‘yf you are unacquainted with “records” and “turntables,” ask the nearest elderly person about 
them. 
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of the probability of a skip in the near future changes drastically. You now es- 
timate that there is a great likelihood of another skip soon, because experience 
tells you that these skips occur for an underlying reason, usually a piece of fluff 
or lint caught on the phonograph needle. Just so, in a great many situations 
wobbles and glitches in the communication occur for some underlying reason 
that will not go away for a while, and the assumption of memorylessness is 
rendered invalid. What can you do in such situations? There is a good deal of 
theory and practice available on the subject of correcting burst errors, as they are 
called in some parts. This theory and practice will not, however, be part of this 
course. We are calling your attention to the phenomenon of burst errors, and 
to the indefensibility of our blanket assumption of memorylessness in certain 
situations, just because one of the worst things you can do with mathematics 
is to misapply it to situations outside the umbrella of assumption. Probabilis- 
tic assumptions about randomness and independence are very tricky, and the 
assumption of memorylessness of a channel is one such. 

This is not the place to go into detail, but let us assure you that you can 
misapply a result about randomly occurring phenomena (such as the glitches, 
skips, and wobbles in transmissions over our memoryless channels are assumed 
to be) to “show,” in a dignified, sincere manner, that the probability that the 
sun will not rise tomorrow is a little greater than 1/3. The moral is that you 
should stare and ponder a bit, to see if your mathematics applies to the situation 
at hand, and if it doesn’t, don’t try to force it. 


Exercises 3.1 
1. (a) Suppose A = {0,1} and B = {0, 1, *}; suppose we decide to use A= 
{0000, 1111} as the new input alphabet, for some reason. How large 
will the new output alphabet be? 
(b) In general, for any input alphabet A and output alphabet B, with |B| = 
k, if we take a new input alphabet Ac A®, how many elements will 
the new output alphabet have? What will the new output alphabet be? 


2. A certain binary channel has the binary alphabet as its output alphabet, as 
well: A = B = {0, 1}. This channel has a memory, albeit a very short one. 
At the start of a transmission, or right after the successful transmission of 
a digit, the probability of a correct transmission is p (regardless of which 
digit, 0 or 1, is being transmitted); right after an error (0 input, | output, 
or | input, 0 output), the probability of a correct transmission is q. (If this 
situation were real, we would plausibly have 1/2 < q < p < 1.) In terms 
of p and q, find 


(a) the probability that the string 10001 is received, if 11101 was sent; 
(b) the probability that 10111 was received, if 11101 was sent; 


(c) the probability of exactly two errors in transmitting a binary word of 
length 5; 
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(d) the probability of two or fewer errors, in transmitting a binary word of 
length n. 


3. Another binary channel has A = B = {0,1}, and no memory; the proba- 
bility of a correct transmission is p, for each digit transmitted. Find the 
probabilities in problem 2, above, for this channel. 


3.2 Transition probabilities and binary symmetric 
channels 


Now we shall begin to work out the consequences of the assumption of memo- 
rylessness. Let the input alphabet be A = {a ,...,a,}, and the output alphabet 
be B = {b,..., bx}. By the assumption of memorylessness, the probability that 
b; will be received, if a; is the input character, depends only on i, j, and the 
nature of the channel, not on the weather nor the recent history of the channel. 
We denote this probability by jj. 

The qj; are called the transition probabilities of the channel, and the nxk 
matrix Q = [qj;] is the matrix of transition probabilities. After the input and 
output alphabets have been agreed upon, Q depends on the hardware, the chan- 
nel itself; or, we could say that Q is a property of the channel. In principle, qj; 
could be estimated by testing the channel: send a; many times and record how 
often b; is received. In practice, such testing may be difficult or impossible, 
and the qj; are either estimated through theoretical considerations, or remain 
hypothetical. Note that a gij = 1, for each 7; that is, the row sums of Q are 
all 1. 

A binary symmetric channel (BSC, for short) is a memoryless channel with 
A = B = {0, 1} like that described in Exercise 3.1.3; whichever digit, 0 or 1, is 
being transmitted, the probability p that it will get through correctly is called 
the reliability of the channel. Usually, 1/2 < p < 1, and we hope p is close to 1. 
Letting 0 and | index the transition probabilities in the obvious way, the matrix 
of transition probabilities for a binary symmetric channel with reliability p is 


ga | a P | 

gio aia, aps> po 

The word “symmetric” in “binary symmetric channel” refers to the symmetry 

of Q, or to the fact that the channel treats the digits 0 and | symmetrically. 
Observe that sending any particular binary word of length n through a 

binary symmetric channel with reliability p is an instance of n independent 

Bernoulli trials, with probability p of Success on each trial (if you count a cor- 

rect transmission of a digit as a Success). Thus, the probability of exactly k 

errors (n — k Successes) in such a transmission is (7) err _ py, and the av- 

erage or expected number of errors is n(1 — p). 
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Exercises 3.2 


1. Fora particular memoryless channel we have A = {0, 1}, B = {0, 1, «}, and 
the channel treats the input digits symmetrically; each digit has probability 
p of being transmitted correctly, probability g of being switched to the 
other digit, and probability r of being fuzzed, so that the output is «. Note 
that p+q+r=l1. 


(a) Give the matrix of transition probabilities, in terms of p,q, andr. 

(b) In terms of n, p, and k, what is the probability of exactly k errors 
(where an error is either a fuzzed digit or a switched digit) in the trans- 
mission of a binary word of length n, over this channel? 

(c) Suppose that * is eliminated from the output alphabet by means of 
coin flip, with a fair coin. Whenever * is received, the coin is flipped; 
if heads comes up, the * is read is 0, and if tails comes up, it is read 
as 1. What is the new matrix of transition probabilities? Is the channel 
now binary symmetric? 

(d) Suppose that + is eliminated from the output alphabet by merging it 
with 1. That is, whenever x is received, it is read as | (this amounts to 
a coin flip with a very unfair coin). What is the new matrix of transition 
probabilities? Is the channel now binary symmetric? 


2. A binary symmetric channel has reliability p. 


(a) What is the minimum value of p allowable, if there is to be at least a 
95% chance of no errors at all in the transmission of a binary word of 
length 15? 

(b) Give the inequality that p must satisfy if there is to be at least a 95% 
chance of no more than one error in the transmission of a binary word 
of length 15. For the numerically deft and/or curious: is the minimum 
value of p satisfying this requirement significantly less than the mini- 
mum p satisfying the more stringent requirement in part (a)? 

(c) What is the minimum value of p allowable if the average number of 
errors in transmitting binary words of length 15 is to be no greater than 
1/2? 


3. A= B= ({O, 1}, and the channel is memoryless, but is not a binary symmet- 
ric channel because it treats 0 and 1| differently. The probability is po that 
0 will be transmitted correctly, and p; that 1 will be transmitted correctly. 


In terms of po and py, find the probabilities described in Exercise 3.1.2 (a) 
and (b). Also, if a binary word of length n has z zeros and n — z ones, with 
n > 2, find, in terms of po, pi,n, and z, the probability of two or fewer 
errors in the transmission of the word. 


4. Suppose we decide to take A? as the new input alphabet. Then B? will 
be the new output alphabet. How will the new transition probabilities 
4(i,i)(j,j’) be related to the old transition probabilities qj; ? 
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5. We have a binary symmetric channel with reliability p. We take A= {000, 
111} as the new input alphabet. Find the new output alphabet and the new 
transition probabilities. 


6. Here is a quite general way of modifying the output alphabet of a discrete 
channel that includes the idea of “amalgamation” discussed in Section 3.1 
and the idea of “amalgamation with a coin flip” broached in Exercise 3.2.1. 
We may as well call this method probabilistic amalgamation. Suppose that 
A = {aj,...,a,} and B = {b,..., bx} are the input and output alphabets, 
respectively, of a discrete memoryless channel, with transition probabilities 
qij- Let B= {B1,--.-, Bm}, m => 2, be a new (output) alphabet, and let 
ujt, fj =1,...,k, t= 1,...,m, be_probabilities satisfying poe a, =1 
for each j = 1,...,k. We make B into the new output alphabet of the 
channel by declaring that b; will be read as B; with probability uj,. That 
is, whenever b; is the output letter, a probabilistic experiment is performed 
with outcomes 61,..., Bm and corresponding probabilities uj1,...,ujm to 
determine which of the new output letters will be the output. 


(a) In each of Exercises 3.2.1 (c) and (d) identify B and give the matrix 

U=[u,;]. 
(b) In general, supposing that B has been replaced by B as described, 
express the new matrix of transition probabilities O = [gir] for the new 
channel with input alphabet A and output alphabet B in terms of the 
old matrix of transition probabilities Q and the matrix of probabilities 
U =[uj1). 
Suppose that A = {0, 1}, B = {0, 1, *}, and 

p= B qo al = 9 02 cl 
io lt 1x 05.88 .07|° 

Find a way to probabilistically amalgamate B to B = A, so that the 
resulting channel is binary symmetric, and uo9 = v1 = 1. (Thatis, find 
a3 x 2 matrix U = [uj;] that will do the job.) Is there any other way 


(i.e., possibly with uoo # 1 or wu; # 1) to probabilistically amalgamate 
B to A to give a BSC with a greater reliability? 


(c 


wm 


== 


3.3 Input frequencies 


As before, we have a memoryless channel with input alphabet A = {aj,..., 
a}, Output alphabet B = {bj,..., bx}, and transition probabilities q;;. For 
i € {1,...,n}, let p; denote the relative frequency of transmission, or input fre- 
quency, of the input character a;. In a large number of situations, it makes sense 
to think of p; as the proportion of the occurrences of a; in the text (written in 
input alphabetic characters) to be transmitted through the channel. 
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There is a bit of ambiguity here: by “the text to be transmitted” do we mean 
some particular segment of input text, or a “typical” segment of input text, or 
the totality of all possible input text that ever will or could be transmitted? This 
is one of those ambiguities that will never be satisfactorily resolved, we think; 
we shall just admit that “the input text” may mean different things on different 
occasions. Whatever it means, pj; is to be thought of as the (hypothetical) prob- 
ability that a character selected at random from the input text will be a;. This 
probability can sometimes be estimated by examining particular segments of 
text. For instance, if you count the number of characters, including punctuation 
marks and blanks, in this text, from the beginning of this section until the end 
of this sentence, and then tally the number of occurrences of the letter ‘e’ in 
the same stretch of text, you will find that ‘e’ accounts for a little less than 1/10 
of all characters; its relative frequency is estimated, by this tally, to be around 
0.096. You can take this as an estimate of the input frequency of the letter ‘e’ 
for any channel accepting the typographical characters of this text as input. This 
estimate is likely to be close to the “true” input frequency of ‘e’, if such there 
be, provided the text segments to be transmitted are not significantly different 
in kind from the sample segment from which 0.096 was derived. On the other 
hand, you might well doubt the validity of this estimate in case the text to be 
transmitted were the translation of “Romeo and Juliet” into Polish. 

There are situations in which there is a way to estimate the input frequen- 
cies other than by inspecting a segment of input text. For instance, suppose we 
are trying to transmit data by means of a binary code; each datum is represented 
by a binary word, a member of {0, 1}*, and the binary word is input to a binary 
channel. We take A = {0, 1}. Now, the input frequencies po and pj, of 0 and 1, 
respectively, will depend on the frequencies with which the various data emerge 
from the data source, and on how these are encoded as binary words. We know, 
in fact we control, the latter, but the former may well be beyond our powers of 
conjecture. If the probabilities of the various data emerging are known a pri- 
ori, and the encoding scheme is agreed upon, then po and p, can be calculated 
straightforwardly (see Exercise | at the end of this section). 

Otherwise, when the relative frequencies of the source data are not known 
beforehand, it is a good working rule that different data are to be regarded as 
equally likely. The justification for this rule is ignorance; since probability in 
practice is an a priori assessment of likelihood, in case there is no prior know]- 
edge you may as well assess the known possibilities as equally likely. 

We now return to the general case, with A = {aj,...,a,} and a; having 
input frequency p;. Observe that }~_, pj = 1. Also, note that the probabilities 
p; have nothing to do with the channel; they depend on how we use the input 
alphabet to form text. They are therefore manageable, in principle; we feel that 
if we know enough about what is to be transmitted, we can make arrangements 
(in the encoding of the messages to be sent) so that the input frequencies of 
aj,..-,Qn are as close as desired to any prescribed values p1,..., Pn = 0 satis- 
fying )~)_, pi = 1. The practical difficulties involved in approaching prescribed 
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input frequencies are part of the next chapter’s subject. For now, we will ignore 
those difficulties and consider the p; to be variables; we pretend to be able to 
vary them, within the constraints that pj > 0, i = 1,...,m and >; pi = 1. In 
this respect the p; are quite different from the q;;, about which we can do noth- 
ing; the transition probabilities are constant parameters, forced upon us by the 
choice of channel. 

We now focus on the act of attempting to transmit a single input character. 
We regard this as a two-stage experiment. The first stage: selecting some a; for 
transmission. The second stage: observing which b; emerges at the receiving 
end of the channel. We take, as the set of outcomes, 


S={(aj,b);i€{1,...,n},7 €{1,..., A}, 


in which (a;,b;) is short for “a; was selected for transmission, and b; was 
received.” We commit further semantic atrocities in the interest of brevity: a; 
will stand for the event 


{(aj,b1),..., (aj, be) } = “a; was selected for transmission,” 


as well as standing for the ith input character; similarly b; will sometimes de- 
note the event “b; was received.” Readers will have to be alert to the context, in 
order to divine what means what. For instance, in the sentence “P(a;) = pj,” it 
is evident that a; stands for an event, not a letter. 


3.3.1 With P denoting the probability assignment to S, and noting the abbre- 
viations introduced above, it seems that we are given the following: 


(i) P(aj) = pi, and 
(ii) P(b; | ai) = qij, whence 
(iii) P(aj,b;) = P(aiN bj) = piqij, and 


(iv) P(bj) = Liter Pears. 
The probabilities P(b;) in (iv) are called the output frequencies of b;, j = 1, 
...,k. It is readily checked that P(b;) > 0 and ae P(bj) =1. 

Now, the careful and skeptical reader will, we hope, experience a shiver of 
doubt in thinking all of this over. Putting aside qualms about memorylessness 
and the invariability of the q;;, there is still an infelicity in the correspondence 
between the “model” and “reality” in this two-stage experiment view of trans- 
mission of a single character, and the problem is in the first stage. In order to 
assert (i), above, we must view the process of “selecting an input character” 
as similar to drawing a ball from an urn; we envision a large urn, containing 
balls colored aj,...,a,, with proportion p; of them colored aj, i = 1,...,n. 
Attempting to transmit a string of input symbols means successively drawing 
balls from this urn, with replacement and remixing after each draw; this is what 
our “model” says we are up to. 

The problem is that this does not seem much like what we actually do when 
dealing with input text. If you are at some point in the text, it doesn’t seem that 
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the next character pops up at random from an urn; it seems that the probabilities 
of the various characters appearing next in the text ought to be affected by where 
we are in the text, by what has gone before. This is certainly the way it is with 
natural languages; for instance, in English, ‘u’ almost always follows ‘q’ and 
‘b’ rarely follows ‘z’. Thus, for the English-speaking reader, the “draw from 
an urn” model of “selecting the next character for transmission” breaks down 
badly, when the input text is in English. 

Notice also the situation of Exercise 3.3.1. Because of the way the source 
messages are encoded, it seems intuitively obvious that whenever a 0 is input, 
the probability of the next letter for transmission being 0 is greater than po, the 
relative frequency of O in the input text. (And that is, in fact, the case. You 
might verify that, after a 0, assuming we know nothing else of what has been 
transmitted already, the probability that the next letter will be 0 is 17/24, while 
po < 1/2.) 

Nevertheless, we shall hold to the simplifying assumption that p;, the pro- 
portion of a;’s in the input text, is also the probability that the next letter is aj, 
at any point in the input stream. This assumption is valid if we are ignorant of 
grammar and spelling in the input language; we are again, as with the transition 
probabilities, in the weird position of bringing a probability into existence by 
assuming ignorance. In the case of the transition probabilities, that assumption 
of ignorance is usually truthful; in the present case, it is more often for con- 
venience, because it is difficult to take into account what we know of the input 
language. There are ways to analyze information transfer through discrete chan- 
nels with account taken of grammar and/or spelling in the input language—see 
Shannon’s paper [65], and the discussion in Chapter 7 of this text. We shall not 
burden the reader here with that more difficult analysis, but content ourselves 
with a crude but useful simplification, in this introduction to the subject. 


Exercises 3.3 


1. Suppose a data, or message, source gives off, from time to time, any one 
of three data, or messages, M,, Mz, and M3. M, accounts for 30% of all 
emanations from the source, Mz for 50%, and M3 for 20%. 


These messages are to be transmitted using a binary channel. To this end, 
Mj is encoded as 11111, M2 as 100001, and M3 as 1100. Find po and pj, 
the input frequencies of 0 and 1, respectively, into the channel to be used 
for this task. 


[Hint: suppose that a large number N of messages are in line to be trans- 
mitted, with 3N/10 of them instances of M,, N/2 of them M2, and N/5 
of them M3. Count up the number of 0’s and the number of 1’s in the 
corresponding input text. ] 


2. Same question as in Exercise | above, except that nothing is known about 
the relative frequencies of M,, M2, and M3; apply the convention of as- 
suming that M,, M2, and M3 are equally likely. 
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3. A binary symmetric channel with reliability p is used in a particular com- 
munication task for which the input frequencies of 0 and 1 are po = 2/3 
and p; = 1/3. Find the output frequencies of 0 and | in terms of p. [Hint: 
apply 3.3.1(iv).] 

4. Let A= {a1, 42,43}, B = {b1, bz, b3}, 


94 04 .02 
O=|.01 .93 .06], 
03.04.93 


Pi= 4, p2 =.5, and p3 = .1. Find the output frequencies, P(b;), P(b2), 
and P(b3). 


5. Suppose, in using the channel of the preceding problem, there is a cost as- 
sociated with each attempted transmission of a single input letter. Suppose 
the (i, j)-entry of the following matrix gives the cost, to the user, of b; 
being received when a; was sent, in some monetary units: 


0 5 9 
C=]10 0 2 
4 2 0 


(a) Express, in terms of p1, p2, and p3, the average cost per transmission- 
of-a-single-input-letter of using this channel. Evaluate when p; = .4, 
pP2= 5, and P3= lL. 

(b) What choice of p1, p2, p3 minimizes the average cost-per-use of this 
channel? Would the user be wise to aim to minimize that average cost? 


————== 


3.4 Channel capacity 


A, B, qij, and the p; will be as in the preceding section. With the a; standing for 
events, not characters, A = {a1,...,a,} is a system of events in the probability 
space associated with the two-stage experiment of sending a single character 
through a memoryless channel with input alphabet A and output alphabet B. 
Observe that we have taken on yet another risk of misunderstanding; A will 
sometimes be an alphabet, sometimes a system of events, and you must infer 
which from the context. When a system of events, A, is the input system of 
events, for the channel with input alphabet A. Similarly, B = {bj,..., bx} will 
sometimes stand for a system of events called the output system. 

We are interested in communication, the transfer of information; it is rea- 
sonable to suppose that we ought, therefore, to be interested in the mutual in- 
formation between the input and output systems, 
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nek 
P(ajNbj) 
I(A, B)= P (a; b;) log ————— 
d, 5 eT P(ai)P (bj) 
Piqij 
= a .s Pidij lo§ —Sa 
i=l j=l Pid = 1 PrQtj 
ad Pi qij log = 
=> B a 1 mae 
I(A, B) is a function of the variables p1,..., pn, the input frequencies. It would 
be interesting to know the maximum value that /(A, B) can have. That maxi- 
mum value is called the capacity of the channel, and any values of pj,..., Pn 


for which that value is achieved are called optimal input (or transmission) fre- 
quencies for the channel. If you accept 7(A, B) as an index, or measure, of the 
potential effectiveness of communication attempts using this channel, then the 
capacity is the fragile acme of effectiveness. This peak is achieved by optimally 
adjusting the only quantities within our power to adjust, once the hardware has 
been established and the input alphabet has been agreed to, namely, the input 
frequencies. The main result of this section will show how to find the optimal 
input frequencies (in principle). But before launching into the technical details, 
let us muse a while on the meaning of what it is that we are optimizing. 


3.4.1 Shannon’s interpretation of I(A, B) as rate of information transfer or 
flow. Suppose that input letters are arriving at the transmitter at the rate of r 
letters per second. The average information content of an input letter is H(A); 
therefore, since the average of a sum is the sum of the averages, there are, on 
average, r H(A) units of information per second arriving at the transmitter. The 
information flow is mussed up a bit by the channel; at what average rate is 
information “flowing” through the channel? 

C. E. Shannon’s answer [63,65]: at the rate r/(A, B) =r(H(A) — H(A | 
B)). This answer becomes plausible if you bear down on the interpretation of 
H(A | B) asa measure of the average uncertainty of the input letter, conditional 
upon knowing the output letter. Shannon calls H(A | B) the “average ambiguity 
of the received signal,” or “the equivocation,” and this last terminology has 
taken root. Note that “the equivocation” is not dependent on the channel alone, 
but also on the input frequencies. In Shannon’s interpretation, it is the amount 
of information removed, on average, by the channel from the input stream, per 
input letter. 

The validity of this interpretation is bolstered by the role the equivocation 
plays in Shannon’s Noisy Channel Theorem, which we will encounter later. For 
right now, here is an elementary example due to Shannon himself. 

Let the base of log be 2, so the units of information are bits. Suppose we 
have a binary symmetric channel with reliability .99, and the input is streaming 
into the receiver at the rate of 1000 symbols (binary digits) per second, with in- 
put frequencies po = p; = 1/2. These are, we shall see soon, optimal, and give 
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I(A, B) = .99log 1.98 + .01log.02 + .919. By the interpretation of /(A, B) 
under consideration, this says that information is flowing to the receiver at the 
average rate of 1000(.919) = 919 bits per second. 

You might object that, on average, 990 of the 1000 digits arriving at the 
receiver each second are correct (i.e., equal to the digit transmitted), so perhaps 
990 bits/second ought to be the average rate of information flow to the receiver. 
Shannon points out that by that reasoning, if the reliability of the channel were 
1/2, i.e., if the channel were perfectly useless, you would compute that informa- 
tion flow to the receiver at 500 bits/second, on average, whereas the true rate of 
information flow in this case ought to be zero. The problem is, whether p = 1/2 
or p = .99, we do not know which of the 1000p correct digits (on average, each 
second) are correct; our uncertainty in this regard means that our estimate of 
the rate of information flow to the receiver ought to be revised downward from 
1000p bits/sec. (Verify: plog,2p+ (1 — p)log,2(1 — p) < p, 1/2< p <1.) 
Why this particular revision, from 990 down to 919 bits/sec? This is where 
H(A |B) =—(01log.01+ .99log.99) + .081 comes in; supposing you know 
which letter, 0 or 1, is received, H(A | B) is the entropy, i.e., average uncer- 
tainty, of the input letter (system), so it is a good measure of the amount of 
information to be subtracted from one (the number of bits just received) due to 
uncertainty about what was sent. (Convinced? Feel uncertain about something? 
Well, that’s entropy, and it’s good for you, taken in moderation.) 

It is preferable to speak of J(A, B) as the average information flow through 
the channel, or flow to the receiver, per input letter, rather than as the average 
amount of information arriving at the receiver (per input letter). The latter 
might reasonably be taken to be H(B), which is, indeed, the average amount 
of information contained in the set of outcomes of the probabilistic experiment 
of “choosing” an input letter and then attempting to transmit it, if we were to 
take B as the set of outcomes; and taking B as the set of outcomes does seem to 
respond to the question of how much information is arriving at the receiver, per 
input letter. But H (B) as a measure of information has no connection with how 
well the channel is communicating the input stream. For instance, for a BSC 
with reliability 1/2, H(B) = log2, while surely the level of communication 
ought to be O = (A, B). 

Use of the word “flow” in this context will aid in understanding the Noisy 
Channel Theorem, in Section 4.6. That theorem discloses a remarkable analogy 
between information flowing through a channel and fluid flowing through a 


pipe. 


3.4.2 Supposing the transition probabilities q;; are known, finding the optimal 
input frequencies for, and thus the capacity of, a given channel is a straightfor- 
ward multi-variable optimization problem; we wish to find where /(A, B), as a 
function of p1,..., Pn, achieves its maximum on 


Ki =A (Disses Pn) © Rpts. Da Oand: 5 prt}. 


By convention, the terms in the sum for /(A, B) corresponding to pairs 
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(i, j) for which g;; = 0 do not actually appear in that sum. Note that if p,; > 0, 
t=1,...,n and )°)_, prqrj = 0, then gj =0, t =1,...,n. It follows that 
the formula for /(A, B) defines a differentiable function in the positive part 
of R”, {(p1,---, Pn) € R"; p; > 0, t = 1,...,n}. Consequently, the Lagrange 
Multiplier Theorem asserts that if (A, B) achieves a maximum on K, in K, ” — 
{(p1,---; Pn) € Kn; pi > 0, i = 1,...,n}, then the maximum is necessarily 
achieved at a point where UA, B) ry pi) =0,k =1,...,n, for some 


X. 


The main content of Theorem 3.4.3, below, is that a sort of converse of this 
statement holds: if the equations arising from the Lagrange Multiplier Theorem 
hold at a point (pj,..., Pn) € K,, then /(A, B) necessarily achieves a maxi- 
mum, on Ky, at (p1,..-, Pn). The proof of this statement is a bit technical, and 
is relegated to the next section, which is optional; although it is preferable that 
even students of applied mathematics understand the theoretical foundations of 
their subject, in this case it probably won’t overly imperil your immortal soul to 


accept the result without proof. 


Let us see where the Lagrange Multiplier method tells us to look for the 
optimal input frequencies. Setting F(p1,..., Pn) =1(A, B) —Ado}_, pi, con- 


sidering only points (p1,..., Pn) where all coordinates are positive, and setting 
c = log(e), we have 
OF. -. 
(I(A, B))—A 
OPs a 
: qij 7. 
ij sj 
= ) 4sj log =—__—¢ ) _ Pi = 
dX = Brey y See eae 1PrQtj 
k kw 
Spee Sag Dizi Pidii gy 
wal Lear PrIj 4 era Pej 
k ae k 
sj 
= )_4sj log se——__ -¢ ) 9s —4 
X rai Prati X 
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= \ 4s) log Sp _- +). 
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Replacing c+ A by C, and setting the partial derivative equal to 0, we obtain the 


capacity equations for the channel. 


3.4.3 Theorem Suppose a memoryless channel has input alphabet A = 


{a1, 


.,4y}, Output alphabet B = {bi,..., bx}, and transition probabilities qj, i € 
{1,...,2}, 7 € {1,...,k}. There are optimal input frequencies for this chan- 
nel. If pj,..., Pn are positive real numbers, then pj,..., Pn are optimal input 
frequencies for this channel if and only if p,,..., Pn satisfy the following, for 
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some value of C: 


n k 
Qsj 
Pi=1 and qsj log =. = C, s=1,...,n. 
a 28ST pa 
= j= 
Furthermore, if pj,..., Py are optimal input frequencies satisfying these equa- 


tions, for some value of C, then C is the channel capacity. 


This theorem may seem, at first glance, to be saying that all you have to do 
to find the capacity of a channel and the optimal input frequencies is to solve 
the capacity equations of the channel, the equations arising from the Lagrange 
Multiplier Theorem, and the condition }~_, pj = 1, for p1,..., Pn > 0. There 
is a loophole, however, a possibility that slips through a crack in the wording 
of the theorem: it is possible that the capacity equations have no solution. See 
problems 9 and 14 at the end of this section. Note that in problem 9, it is not 
just that the equations have no solution (~1,..., Pn) with all the p; positive; the 
equations have no solution, period. 

From Theorem 3.4.3 you can infer that this unpleasant phenomenon, the ca- 
pacity equations having no solution, occurs only when the capacity is achieved 
at points (p1,..-, Pn) € Ky with one or more of the p; equal to zero. If pj = 0, 
then a; is never used; we have thrown away an input character; we are not using 
all the tricks at our disposal. Problems 9 and 14 show that it can, indeed, happen 
that there are input characters that we are better off without. Note, however, the 
result of problem 10, in which the channel quite severely mangles and bullies 
one of the input letters, a,, while maintaining seamlessly perfect respect of the 
others; yet, in the optimal input frequencies, p, is positive, which shows that 
we are better off using a, than leaving it out, in spite of how terribly the chan- 
nel treats it (provided we accept /(A, B) as a measure of how well off we are). 
In this respect, note also the results of exercise problems 2, 6 (a special case 
of problem 10 when p = 1/2), and 7. The practical moral to be drawn from 
these examples seems to be that if the channel respects an input character even 
a little bit, if you occasionally get some information from the output (upon in- 
putting this character) about the input, then you are better off with the character 
than without it. The surprising result of Exercise 14 obliterates this tentative 
conclusion, and shows that we may be in the presence of a mystery. 

How will we know when we are in the rare necessity of banishing one or 
more input characters, and what do we do about determining the optimal input 
frequencies in such cases? According to Theorem 3.4.3, we are in such a case 
when and only when the capacity equations of the channel have no solution in 
K,*. In such a situation, the n-tuple (p1,..., Pn) of optimal input frequencies 
lies on one of the faces of Kn, Fr = {(P1,---, Pn) € Kn; pi > 0 fori € R and 
pi =O fori ¢ R}, where R is a proper subset of {1,...,}. For such an R, let 
Ar = {a; € A;i € R}, the input alphabet obtained by deleting the a; indexed by 
indices not in R. Finding (p1,..., pn) on Fr amounts to solving the channel 
capacity problem with A replaced by Ar; if (p1,..., Pn) € Fr is the n-tuple of 
optimal input frequencies, then the non-zero p;, those indexed by i € R, will 
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satisfy the capacity equations associated with this modified problem. (These 
equations are obtainable from the original capacity equations by omitting those 
p; and qij withi ¢ R.) 

Thus, if the capacity equations for the channel have no solution (p1,..., Pn) 
with pj > 0, i = 1,...,n, we need merely solve the 2” —n —2 systems of 
capacity equations associated with the Ar, for R satisfying 2 < |R| <n—-1. 
It is a consequence of Theorem 3.4.3 that we may first consider all Ar with 
|R| =n —1, and from among the various solutions select one for which the cor- 
responding capacity is maximal. If there are no solutions, move on to Ar with 
|R| =n—2, and so on. All of this is straightforward, but it is also a great deal of 
trouble; we hope that in most real situations the optimal input frequencies will 
be all positive. 


3.4.4 As mentioned above, the proof of the main assertion of 3.4.3 is postponed 
until the next section, the last of this chapter. However, we can give the proof of 
the last assertion here. If p1,..., Py satisfy the equations above, then the value 
of (A, B) at (p1,..-, Pn) is 


I(A, = Do La east 2S ne 


eee i=l 


To remember the capacity equations, other than }“i_, pj = 1, it is helpful 
to remember that the left-hand side of 


Yay los ae 


is the thing multiplying p, in the formula for 


= ae 7 


I(A, n= Dada log = 
sa 1 


ae 


3.4.5 The capacity of a binary symmetric channel. Suppose a binary symmetric 
channel has reliability p. Let po, p1 denote the input frequencies of 0 and 1, 
respectively. The capacity equations are: 


C) pot pi=1, 

OC ae ree eae ee eee EE 
popt+ pidl— p) pod — p)+ pip 

(3) (1= p)log SS . 


Pop + pi(l— p) pol—p)+pip 
Setting the left-hand sides of (2) and (3) equal, and canceling plog p and (1 — 
p)log( — p), we obtain 


plog(pop + pi(l— p))+ (1 — p)log(po(1 — p)+ pip) 
= (1— p)log(pop + pil — p))+ plog(po( — p)+ pip), 
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whence 


(2p — 1)log(pop + pid — p)) = (2p — 1) log(po( — p)+ pip), 
so either p = 1/2 or 


Ppop+ pid— p)= po — p)+ pip, 


(2p —1)po= (2p — 1) pi, 


sO po = pi = 1/2 (in view of (1)) if p A 1/2, and the channel capacity is 
C = plog2p+ (1 — p)log2(1 — p), obtainable by plugging po = p; = 1/2 
into either (2) or (3) above. 

If p = 1/2, then, since 

P _ l—p = 
pop+pil—p) pol—p)+pip 
for all values of po, pi satisfying po + pi = 1, in this case, we have /(A, B) =0 
for all values of po, pi. This is as it should be, since when p = 1/2, sending 
a digit through this channel is like flipping a fair coin. We learn nothing about 
the input by examining the output, the input and output systems are statistically 
independent, the channel is worthless for communication. 

Note that it is not obvious, a priori, that plog2p + (1 — p)log2(1 — p) is 
positive for all values of p € [0, 1] \ {1/2}, but that this is the case follows from 
Theorem 2.2.13. 

The foregoing shows that when p 4 1/2, po, pi = 1/2 are the unique opti- 
mal input frequencies of a binary symmetric channel of reliability p. If we had 
wished only to verify that pop = p1 = 1/2 are optimal—i.e., if the uniqueness is 
of no interest—then we could have saved ourselves some trouble, and found the 
capacity, by simply noting that po = p; = 1/2 satisfy (1) and make the left-hand 
sides of (2) and (3) equal. The optimality of po = p; = 1/2, and the expression 
for C, then follow from Theorem 3.4.3. For a generalization of this observation, 
see 3.4.7, below. 


3.4.6 Here are two questions of possible practical importance that are related, 
and to which the answers we have are incomplete: 


(i) When (under what conditions on Q) are the optimal input frequencies of 
a channel unique? 


(ii) Do the optimal input frequencies of a channel depend continuously on 
the transition probabilities of the channel? 


Regarding (i), the only instances we know of when the optimal input fre- 
quencies are not unique are when the capacity of the channel is zero. (Certainly, 
in this case, any input frequencies will be optimal; but the remarkable thing is 
that it is only in this case that we have encountered non-unique optimal input 
frequencies.) We hesitantly conjecture that if the channel capacity is non-zero, 
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then the optimal input frequencies are unique. For those interested, perhaps the 
proof in Section 3.5 will reward study. 

Regarding (ii), there is a body of knowledge related to the Implicit Func- 
tion Theorem in the calculus of functions of several variables that provides an 
answer of sorts. Regarding the left-hand sides of the capacity equations as func- 
tions of both the p; and q;j, supposing there is a solution of the equations at 
positive p;,i = 1,...,n, and supposing that a certain large matrix of partial 
derivatives has maximum rank, then for every small wiggle of the q;; there will 
be a positive solution of the new capacity equations quite close to the solution 
of the original system. When will that certain large matrix of partial derivatives 
fail to have maximum rank? We can’t tell you exactly, but the short answer is: 
almost never. Thus, the answer to (ii) is: yes, except possibly in certain rare 
pathological circumstances that we haven’t worked out yet. 

Here is an example illustrating the possible implications and uses of the 
continuous dependence of the optimal input frequencies on the transition prob- 
abilities. Suppose that A = {0, 1}, B = {0, 1, *}, and 


Q =| 4 Fol Fox | _ 93 02 .05 
710 ll 1x O01 95 .04]° 


1 0 0 
0 1 0 
bilities of a BSC. (For the channel associated with O, + has been removed as an 
output letter.) Therefore the optimal input frequencies of the original channel 
are “close” to po = pi = 1/2 — and the channel capacity is “close” to log 2. 
Caution: there is a risk involved in rough estimation of this sort. For instance, 
would you say that the matrix of transition probabilities in Exercise 3.4.14 is 


Then Q is “close” to O — which is the matrix of transition proba- 


1/2 1/4 1/4 
“close” to | 1/4 1/2 1/4]? If you are in a reckless mood, you might well 
1/4 1/4 1/2 


do so, yet the optimal input frequencies for the channel with the latter matrix of 
transition probabilities are 1/3, 1/3, 1/3 (this will be shown below), while the 
optimal input frequencies for the channel of problem 14 are 1/2,0, 1/2. Dis- 
concerting discrepancies of this sort should chasten our fudging and make us 
appreciate numerical error analysis of functions of several variables. But we 
will pursue this matter no further in this text. 


3.4.7 n-ary symmetric channels An n-ary symmetric channel of reliability p 
is a discrete memoryless channel with 


| 
i 


P 
A=B and Q= pee : 


ISP 
n—1 P 


that is, the main diagonal entries of Q are all the same (namely, p), and the 
off-diagonal entries of Q are all the same. (Their common value will have to be 


pa if the row sums are to be 1.) 


3 
| 
_ 
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It is straightforward to verify that pj =--- = py = 1/n satisfy the capacity 
equations of such a channel, with 
ni- 
C =plognp+ A —- p)log"—?? 
n— 
l=p 
leg (plog pA —ip)log= 


so by Theorem 3.4.3, (1/n,..., 1/n) are optimal input frequencies for the chan- 
nel and the capacity is C, above. These optimal input frequencies and this ca- 
pacity are also discoverable by the method explained in the exercise section, 
after Exercise 3.4.12, and this method has the advantage that by it and the ap- 
plication of a little linear algebra theory, it can easily be seen that p; = 1/n, 
i=1,...,n are unique optimal input frequencies except in the case p = 1/n, 
which is precisely the case C = 0. 


Exercises 3.4 
1. Verify directly that f(p) = plog2p+(1— p)log2(1 — p) achieves its max- 
imum, log 2, on [0, 1] at the endpoints, 0 and 1, and its minimum, 0, at 1/2. 


2. Verify that the value of 7(A, B) at the extreme points {(1,0,...,0), (0, 1,0, 
...,0),...,(0,...,0, 1)} of K, is zero. 


3. Suppose A = B = {0, 1}, but the channel is not symmetric; suppose a trans- 
mitted 0 has probability p of being received as 0, and a transmitted 1 has 
probability g of being received as 1. Let po and pj, denote the input fre- 
quencies. In terms of p,q, po, and pi, write (A, B), and give the capacity 
equations for this channel. 


4. Give I(A, B) and the capacity equations for the channel described in Ex- 
ercise 3.3.4. 


5. A= {0,1}, B = {0, 1, *}, and the channel treats the input characters sym- 
metrically; for each input, 0 or 1, the probability that it will be received as 
sent is p, the probability that it will be received as the other digit is q, and 
the probability that it will be received as « is r. Note that p+q+r=1. 


Find, in terms of p, g, andr, the capacity of this channel and the optimal 
input frequencies. 


6. A = B = {a,b}; a is always transmitted correctly; when b is transmitted, 
the probability is p that b will be received (and, thus, | — p that a will be 
received). Find, in terms of p, the capacity of this channel and the optimal 
input frequencies. Verify that even when p = 1/2 (a condition of maximum 
disrespect for the input letter b), the capacity is positive (which is greater 
than the capacity would be if the letter b were discarded as an input letter— 
see Exercise 2, above). 


7. [Part of this exercise was lifted from [37].] A = B = {a,b,c}, a is always 
transmitted correctly, and the channel behaves symmetrically with respect 
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to b and c. Each has probability p of being transmitted correctly, and prob- 
ability 1 — p of being received as the other character (c or b). (Thus, if a is 
received, it is certain that a was sent.) 


(a) Find the capacity of this channel, and the optimal input frequencies, as 
functions of p. 

(b) Suppose that c is omitted from the input alphabet (but not the output 
alphabet). Find the capacity of the channel and the optimal input fre- 
quencies in this new situation. 

(c) Are there any values of p for which the capacity found in (b) is greater 
than that in (a)? What about the case p = 1/2? 


8. Suppose that A = B = {aj,...,a,}, and the channel is perfectly reliable: 
when q; is sent, a; is certain to be received. Find the capacity of this channel 
and the optimal input frequencies. 


9. Suppose that A = {a],...,dn41}, B ={a1,...,a,}, and the channel respects 
a\,...,Qn perfectly; when aq; is sent, a; is certain to be received, 1 <i <n. 


(a) Suppose that when a,+1 is sent, the output characters a),...,@, are 
equally likely to be received. Show that the capacity equations for the 
channel have no solution in this case. Find the optimal input frequen- 
cies and the capacity of this channel. 

(b) Are there any transition probabilities gn41,;, j = 1,...,n, for which 
there are optimal input frequencies p1,..., Pn+1 for this channel with 
Pn+1 > 0? If so, find them, and find the corresponding optimal input 
frequencies and the channel capacity. 


10. Suppose that n > 2, A = {aj,...,a,} = B, and the channel respects a1, ..., 
an—1 perfectly. Suppose that, when a, is sent, the output characters a1,..., 
ay are equally likely to be received. Find the optimal input frequencies and 
the capacity of this channel. 


11. We have a binary symmetric channel with reliability p, but we take A = 
{000, 111}. Let the input frequencies be denoted po and p . In terms of 
P, po, and pj, write the mutual information between inputs and outputs, 
and the capacity equations of this channel. Assuming that pp = pi = 1/2 
are the optimal input frequencies, write the capacity of this channel as a 
function of p. 


12. (a) Show that (A, B) < H(A). (This is a special case of a result in Sec- 
tion 2.4.) 


(b) Show that (A, B) = H(A) if and only if for each letter b; received, 
there is exactly one input letter a; such that P(a; | bj) = 1 (so P(ax | 
bj) =0 fork #i). (Hint: recall that H(A | B) = H(A) —J(A, B); use 
Theorem 2.3.5 or its proof.] In other words, 1(A, B) = H(A) if and 
only if the input is determinable with certainty from the output. In yet 
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other words, /(A, B) = H(A) if and only if the input system of events 
is an amalgamation of the output system. 


For exercises 13—15, we are indebted to Luc Teirlinck, who observed that 
I(A, B) = H(B)— H(B| A), 
so that if 
—A(B|A)= ye Pi Sai log qij [verify!] 
i j 


does not depend on pj,..., Pn, as it will not if the sums S$; = ar, qij log qij 
are all the same, i = 1,...,n, then (A, B) is maximized when H(B) is. The 
obvious way to maximize H(B) is to “make” P(bj) = )7/_) prqij equal to 
1/k, 7 =1,...,k. Thus, in these cases, the optimal input frequencies p),..., Pn 
might be found by solving the linear system 


Pit--+pn=1 
n 
Sigg ale. Fa Beth, 
t=1 


[The first equation is redundant: to see this, sum the r equations just above over 
j.] This method is not certain to succeed because the solutions of this linear 
system may fail to be non-negative, or may fail to exist. 

Notice that the sums S; will be all the same if each row of Q is a rearrange- 
ment of the first row. 


13. Find the optimal input frequencies when 


913. 0 1B 
O=| 1/3 2/3 0 
0 1/3 2/3 


Also, find the capacity of the channel. 


14. Find the optimal input frequencies and the channel capacity, when 


1/2 1/3 1/6 
Q=/1/6 1/2 1/3 
1/6 1/3 1/2 


15. Suppose that n > 3,0 < p < 1, and 


Dp 0 ... O 1l-p 

0 Dp 0 Il-p 
Q= 3 

0 0 p l1-p 

l1-p 0 0 Dp 
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(a) For which values of p does the method of solving a linear system give 
the optimal input frequencies for this channel? 
*(b) What are the optimal input frequencies and the channel capacity, in 
terms of p, in all cases? 


Exercises 14 and 15 are instructive for those interested in the problem of 
getting conditions on Q under which the optimal input frequencies are 
unique and positive. 


*16. Suppose a channel has input alphabet A, output alphabet B, and capacity C. 
Suppose we take A* as the new input alphabet. Show that the new capacity 
is kC. (This result is a theorem in [81]. You may find the results of 2.4 
helpful, as well as the result of exercise 2.3.6.) 


—————_— 


3.5* Proof of Theorem 3.4.3, on the capacity equations 


By the remarks of the preceding section, what remains to be shown is that (i) 
I(A, B) does achieve a maximum on K;,, and (ii) if the capacity equations are 
satisfied, for some C, by some pj,..., Pn > 0, then p1,..., Dn are optimal input 
frequencies for the channel. 

Since K,, is closed and bounded, to prove (i) it suffices to show that I(A, B) 
is continuous on K,. This may seem trivial, since [(A, B) appears to be given 
by a formula involving only linear functions of pj,..., Pn and log, but please 
note that this formula is valid at points (p1,..., Pn) € Kn \ K;7 only by conven- 
tion; there is trouble when one or more of the p; is zero. Still, the verification 
that 7(A, B) is continuous at such points is straightforward, and is left to the 
reader to sort out. Keep in mind that x logx + 0 as x + 0*. See problem | at 
the end of this section. 

A real-valued function f defined on a convex subset K of R” is said to be 
concave if 


f(tut+(1—f)v) > tf (uy +(1—t fv) forall u,v e K, t €[0, 1). 


If strict inequality holds whenever u ¢ v and t € (0,1), we will say that f is 
strictly concave. 

We shall now list some facts about concave functions to be used to finish 
the proof of Theorem 3.4.3. Proofs of these facts are omitted. It is recom- 
mended that the reader try to supply the proofs. Notice that 3.5.3 and 3.5.4, 
taken together, constitute the well-known “second derivative test” for concavity 
and relative maxima of functions of one variable. 


3.5.1 Any sum of concave functions is concave, and if one of the summands is 
strictly concave, then the sum is strictly concave. A positive constant times a 
(strictly) concave function is (strictly) concave. 
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3.5.2 Any linear function is concave, and the composition of a linear function 
with a concave function of one variable is concave. 


3.5.3 If J CR is aninterval, f : J > Ris continuous, and f” <0 on the interior 
of J, then f is concave on J. If f” <0 on the interior of J, then f is strictly 
concave on J. 


3.5.4 If J CR is an interval, f : J > R is concave on J, and f’(xo) = 0 for 
some xo € 7, then f achieves a maximum on J at xo. If f is strictly concave on 
I and f'(xo) = 0, then f achieves a maximum on J only at xo. 
Now we are ready to finish the proof of Theorem 3.4.3. Let 
aor x>0 
By 3.5.3, f is strictly concave on [0, 00). Now, 
k 


n k n n 
(A,B) = ) [pi (aij loeais) “2 2 Pit) log()_ Prati) 
j=l i=l t=1 


i=1 q=1 


$i 5 lene) rede Y pai) 


i=l j=l 


so by 3.5.1 and 3.5.2, (A, B) is a concave function on Ky. It is evident that 
Ky, 1s convex. 


If the capacity equations are satisfied, for some C, at a point (p1,..., Pn) 
with pj,...,Pn > 0, then (p1,..., Pn) € Ky and the gradient of 7(A, B) at 
(P1,--+5 Pn) is 


VI(A, Bye ae = (C — loge, C — loge,..., C — loge). 


That is, the gradient of /(A, B) at (p1,..., Pn) is ascalar multiple of (1,..., 1), 
which is normal to the hyperplane with equation x; +---+x, = 1, in R”, of 
which K;, is a fragment. It follows that the directional derivative of J(A, B), at 
(P1,--+», Pn), in any direction parallel to this hyperplane, is zero. It follows that 
the function of one variable obtained by restricting /(A, B) to any line segment 
in K, through (p1,..., Pn) will have derivative zero at the value of the one vari- 
able corresponding to the point (pi,..., Pn). It follows that 1(A, B) achieves 
its maximum on each such line segment at (p1,..., Pn), by 3.5.4. Therefore, 
I(A, B) achieves its maximum on Ky, at (pj,..., Pn). 


Exercises 3.5 
1. Suppose that (p1,..., Pn—1,0) € Kn, and pi,..., Pn—1 > 0. Show that 


I(A, By\on, as HA. Bg.5y-1.0) 


Gives 
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*2. 


3. 


as (P1,...,; Pn) > (P1,---; Pn—1,9), with (p1,..., Pn) € Kn. [You may 
assume that, for each j € {1,...,k}, gij > 0 for some i € {1,...,n}. (In- 
terpretation?) You may as well inspect the functions fjj(p1,.-., Pn) = 
PiQij log()-y_1 P:41j). No problem when qj; = 0, and no problem when 
1 <i <n—1. Wheni =n, you will need to consider two cases: gij = 
+++ = @n-1,; =, and otherwise. ] 


Under what conditions on the transition probabilities is /(A, B) strictly 
concave on K,,? 


Prove the statements in 3.5.1 and 3.5.2. 
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Coding Theory 


4.1 Encoding and decoding 


The situation is this: we have a source alphabet S = {s1,...,Sm} and a code 
alphabet A = {aj,...,a,}, which is also the input alphabet of some channel. 
We would like to transmit text written in the source alphabet, but our channel 
accepts only code alphabetic characters. Therefore, we aim to associate a code 
alphabet word to represent each source alphabet word that we might wish to 
send. 

In many real situations, it is not really necessary to represent each member 
of St, the set of all source words, by a code word, a member of At. For 
instance, if the source text is a chunk of ordinary English prose, we can be 
reasonably certain that we will not have to transmit nonsense words like “zrdfle” 
or “cccm.” However, it does not seem that there is any great advantage to be had 
by omitting part of S* from consideration, and there is some disadvantage—the 
discussion gets complicated, quarrels break out, anxieties flourish. 


Definitions An encoding function is a function ¢ : St > AT. We say that such 
a function defines, or determines, a code. The code determined by ¢ is said to 
be unambiguous if and only if ¢ is one-to-one (injective). Otherwise, the code 
is ambiguous. 

A valid decoder-recognizer (VDR) for the code determined by ¢ is an al- 
gorithm which accepts as input any w € AT, and produces as output either the 
message “does not represent a source word” if, indeed, w is not in the range of 
¢, or, if w € rang, some v € S* such that ¢(v) = w. 

The code determined by ¢ is uniquely decodable if and only if it is unam- 
biguous and there exists a VDR for it. 


Some remarks are in order. 


4.1.1 Note that the definitions above do not really say what a code is. It is 
something determined by an encoding function, but what? It might be more 
satisfying logically to identify the code with the encoding function which deter- 
mines it, but, unfortunately, that would lead to syntactic constructions that clash 
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with common usage. The definition above stands without apology, but the uses 
of the word “code” may increase in the future. 


4.1.2 A VDR is, as its name indicates, an algorithm that either correctly de- 
codes a code word, or correctly recognizes that the code word cannot be de- 
coded. 

We shall be quite informal about describing VDRs, and extremely cavalier 
about proving that a given algorithm is a VDR for a given code. For instance, 
suppose that S = A = {0, 1}, and that @ is described by: @ doubles each 0 and 
leaves | as is. [Thus, for instance, 6(1010) = 100100.] Then the following 
describes a VDR for this code: given v € {0,1}*, scan v, and if any maximal 
block of consecutive 0’s of odd length is found in v, report “does not represent a 
source word”; otherwise, halve each maximal block of consecutive 0’s in v, and 
output the resulting word w. We leave it to the reader to divine what is meant 
by “maximal block of consecutive 0’s.” 

The point is that we do not make a fuss about how you scan or how you 
find a maximal block of consecutive 0’s in v and determine its length. Any 
implementation of the algorithm described would have to handle these and other 
matters, but the details are not our concern here. Also, it is possible to prove 
formally that this algorithm is a VDR for the given code, and that the code is 
uniquely decodable, but a bit of thought will convince anybody that these things 
are true, so that writing out formal proofs becomes an empty exercise, as well as 
being no fun. It can be of practical value to attempt proofs of algorithm validity 
and unique decodability, especially when these matters are in doubt, but we 
shall not be at all conscientious about such proofs. 


4.1.3 In modern naive set theory, it is proven that for any non-empty S and A, 
there are uncountably many injective functions from St into A+. The codes 
determined by two different such functions cannot have the same VDR. It is 
also proven that there are but countably many algorithms expressible in any 
natural language. It follows that there are quite a few, in fact, uncountably 
many codes with no VDR. We certainly want nothing to do with such codes, 
but don’t worry—there is very little danger of encountering such a code. 


In most of the codes actually in use in real life, the encoding function is 
defined in a particularly straightforward way. 


Definition An encoding scheme for a source alphabet S = {s1,..., 5} in terms 
of a code alphabet A is a list of productions, 


Sj 7 WwW 
Sm > Wm, 


in which w1,...,Wm € At. For short, we will say that such a list is a scheme 
forS— A. 
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Each encoding scheme gives rise to an encoding function ¢ : St > AT 
by concatenation. The concatenation of a sequence of words is just the word 
obtained by writing them down in order, with no separating spaces, commas, or 
other marks. Given an encoding scheme, as above, and a word v € S™, we let 
(v) be the concatenation of the sequence of the w;, 1 <i < m, corresponding, 
according to the scheme, to the source letters occurring in v. For example, if 
S = {a,b,c}, A = {0, 1}, and the scheme is 


a> 0Ol 
b— 10 
c7> 111, 


then ¢(acbba) = 01111101001. 

It is sometimes useful to be more formal; given an encoding scheme, we 
could define the corresponding encoding function by induction on the length of 
the source word. For v € S*, let lgth(v) stand for the length of v, the number of 
letters appearing in v. If lgth(v) = 1, then v € S, so v = 5; for some i, and we set 
(uv) = w;, where w; is the code word on the right-hand side of the production 
Sj; — w; in the scheme. If lgth(v) > 1, then v = us; for some s; € S, and some 
u € S* with lgth(w) = lgth(v) — 1; @(w) has already been defined, so we set 
b(v) = o(u)wi. 

The formality of this definition of ¢ is unnecessary for most purposes, but 
it is advisable to keep it in mind. It provides a form for proving by induction 
statements about the code determined by @. Sometimes the other obvious in- 
ductive definition of ¢, in which source words are formed by adding letters on 
the left rather than on the right, is more convenient. 


4.1.4 When an encoding scheme is given, and thereby an encoding function, 
the term “the code” can refer to (i) the encoding scheme, (ii) the list w1,..., Wn 
of code words appearing in the scheme, or (iii) the set {w1,..., Wm}. 


4.1.5 Theorem Every code determined by an encoding scheme has a VDR. 


Proof: Given a scheme s; > w;,i=1,...,m, anda word w € AT, look among 
all concatenations of the w;, with length of the concatenation equal to Igth(w). 
[There are surely systematic ways to go about forming all such concatena- 
tions—but it would be tiresome to dwell upon those ways here.] If none of them 
match w, report “does not represent a source word.” If one of them matches w, 
decode in the obvious way, by replacing each w; in the concatenation by some 
corresponding s; in the encoding scheme. It is left to you to convince yourself 
that this prescription constitutes a VDR for the given code. O 


The algorithm plan described above is a very bad one, extremely slow and 
inefficient, and should never be used. It is of interest only because it works 
whatever the encoding scheme. 


4.1.6 Given an encoding scheme sj; > w;, i = 1,...,m, reading-left-to-right 
with reference to this scheme is the following algorithm: given w € AT, scan 
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from left to right along w until you recognize some w; as an initial segment 
of w. If the w; you recognize is also w; for some i ¥ j (ie., if the same 
code word represents different source letters according to the encoding scheme, 
heaven forfend), then decide for which i you have recognized w; by some rule 
— for instance, you could let i be the smallest of the eligible indices. 

If no w; has been recognized as an initial segment of w, after scanning over 
max1<j<m lgth(w;) letters of w, or if you come to the end of w after scanning 
fewer letters, without recognizing some w;, report “does not represent a source 
word.” Otherwise, having recognized w;, jot down s; on your decoder pad, 
to the right of any source letters already recorded, peel (delete) the segment w; 
from w, and begin the process anew, with the smaller word replacing w. If there 
is nothing left after w; is peeled from w, stop, and declare that the decoding is 
complete. Reading-right-to-left is described similarly. 

For example, suppose that S = {a,b,c}, A = {0, 1}, and the scheme is 


a0 
b— Ol 
c— 001. 


(This is a particularly stupid scheme, for all practical purposes. Note that 001 
represents both ab and c in the code defined by this scheme.) With reference 
to this scheme, what will be the outcome of applying the reading-left-to-right 
algorithm to 001? Answer: the output will be “does not represent a source 
word.” (Surprised?) It follows that reading-left-to-right is not a VDR for this 
code. However, reading-right-to-left is a VDR for this code. Verification, or 
proof, of this assertion is left to you. (Take a look at Exercise 4.2.2.) 


4.1.7 If the words w; appearing in an encoding scheme are all of the same 
length, the code is said to be a fixed-length or block code, and the common 
length @ of the w; is said to be the length of the code. Otherwise, the code is 
said to be a variable-length code. 


4.1.8 If A = {0, 1}, or some other two-element set, the code is said to be binary. 


Exercises 4.1 


1. Let S be the set of all English words, let A be the set of letters a,b,...,z, 
and let the encoding scheme be defined by a very complete unabridged dic- 
tionary—the O.E.D. will do. Ignore capitalizations. Show that the code 
defined by this scheme is ambiguous. 


2. Suppose that S = A and ¢: St — S* is defined by 


w, if lgth(w) is odd, 
ww, if Igth(w) is even. 


p(w) =| 


Show that ¢ is not given by an encoding scheme. Describe a VDR for this 
code. Is this code uniquely decodable? Justify your answer. 
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w, if Igth(w) is even, 


3. Same questions as in 2, except that 6(w) = ee if Isth@w) is odd 


4. Suppose that S = {a,b,c}, A = {0, 1}, and consider the scheme 
a>0O, b—-010, c—0110. 


Show that neither reading-left-to-right nor reading-right-to-left provides a 
VDR for this code. Describe a VDR for this code—make it a better one 
than the clunker described in the proof of Theorem 4.1.5. Is this code 
uniquely decodable? Justify your answer. 


5. Give an encoding scheme for a uniquely decodable code for which reading- 
left-to-right is a VDR, but reading-right-to-left is not. 


6. Suppose that S = {a,b,c}, A = {0, 1}, and the encoding scheme is 
a—>010, b— 0100, c— 0010. 


Is the code defined by this scheme uniquely decodable? Justify your an- 
swet. 


7. Give an encoding scheme for the code described in 4.1.2. 


4.2 Prefix-condition codes and the Kraft-McMillan 
inequality 
An encoding scheme s; > w;,i =1,...,m, satisfies the prefix condition if there 
do not exist 7, 7 € {1,...m},i A j, such that w; is an initial segment, or prefix, 


(reading left to right) of w;. The code determined by such a scheme is said to 
be a prefix-condition code. The suffix condition is similarly defined. 


4.2.1 Theorem Each prefix-condition code is uniquely decodable, with read- 
ing left-to-right providing a VDR. 


Proof: Left to you. O 
Remark: a converse of this theorem holds; see Exercise 4.2.2. 


4.2.2 Proposition Ifs; > w;,i=1,...,m, is a fixed-length encoding scheme, 
then the following are equivalent: 


(a) the scheme satisfies the prefix condition; 
(b) the code defined by the scheme is uniquely decodable; 


(Cc) W1,...,Wm are distinct. 


Proof: Left to you. O 
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4.2.3 Corollary Given n = |A|, m = |S|, and a positive integer ¢, there is a 
uniquely decodable fixed-length scheme for S — A, of length £, if and only if 


m<ne, 


Proof: n’ = |A‘|, so m < n° means that there are m distinct code words of 


length ¢ available for the desired scheme. O 


4.2.4 The code in Exercise 4.1.4 is neither prefix-condition nor suffix-condition, 
but is, nonetheless, uniquely decodable. 


4.2.5 If the beginning of the code word is on the left—i.e., if the code word is to 
be fed into the decoder from left to right—then it is clearly a great convenience 
for reading-left-to-right to be a VDR for the code; you can decode as the code 
word is being read. By contrast, in order to decode using reading-right-to-left, 
if the code word starts on the left, you have to wait until the entire code word or 
message has arrived before you can start decoding. Because of this advantage, 
prefix-condition codes are also instantaneous codes (see [81]). Note Exercise 


4.2.2. 

4.2.6 Theorem (Kraft’s Inequality) Suppose that S = {s,,..., 8m} is a source 
alphabet, A = {a),...,d,} is a code alphabet, and €,...,€m are positive inte- 
gers. Then there is an encoding scheme s; > w;,i = 1,...,m, for S in terms of 


A, satisfying the prefix condition, with |gth(w;) = €;, i = 1,...,m, if and only 
i an 2, 


Proof: For w € A* and £ > Igth(w), let 
A(w, £) = {v € A®; w isa prefix of v} (4.1) 
={wusu e Ao leh) (4.2) 


Then |A(w, £)| = |Ao18"™)| = n‘!8™) | Observe that, if neither of w;, w2 € 
At is a prefix of the other, and ¢ > lgth(w;), i = 1,2, then A(w,€) and 
A(w2, £) are disjoint. 

Assume, without loss of generality, that 1 < 0; < 02 <--- < &». First 
suppose that )~".,n—% <1. We will choose w1,..., Wm € A* such that no w; 
is a prefix of any w;,1 <i < j <™m, and the choosing will be straightforward. 
Let w; be any member of A", Supposing we have obtained w),..., wg with 
w; €A%,i=1,...,k, and no uw; a prefix of any wj,1 <i <j <k<m—1, we 
wonder if there is any wr+1 € A+! such that none of w1,..., wz is a prefix of 
wx+41. Clearly there is such a w;+1 if and only if 


k 


A&H\ J A(wi, fer) 
i=1 


is non-empty. By the remarks above (and since the A(wj, 0x41), i = 1,...,k, 
are pairwise disjoint), 
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k k k 
LJ Ai, Ce41)| = S. |A(w;, £egi)| = Sneath) 
i=1 ia j=1 

k m 


eb Pb eee 
i=1 i=1 


Thus A‘&+! VLEs A(wj, €k+1) is non-empty. Thus we can find wj,..., Wm as 
desired, by simple hunting and finding. 

On the other hand, if | < ¢; <--- < €» and there is a prefix-condition 
scheme sj; > wj;, with lgth(w;) = €;,i = 1,...,m, then 


m—1 
wm € A™\ _) A(wj, £m), 
j=l 
SO 


m—1 m—1 


1 < |Ao | — > |A(wj,£m)| = pin — pin yin, 
j=1 


j=l 
whence )"_ jn <1. O 


4.2.7 The usefulness of being able to prescribe the lengths ¢1,...,m of the 
words w1,..., Wm in a prefix-condition encoding scheme will become clear in 
the next two sections. See also Exercise 4.2.5. 


Once ¢1,..., &m satisfying pe n—& <1 have been prescribed, there is no 
obstacle, according to the proof preceding, to choosing the w; for the scheme, 
provided £, < 2 <--- < £m. It may be necessary to reorder 51, ..., 5; to achieve 
this ordering of the €;. 

In some situations it may be wise to prescribe £1,...,€, satisfying the 
inequality )°"_ ,n—" <1, ie., to avoid €1,...,m satisfying °" ,n—"% = 1, 
even though the ¢; in such a sequence may be more desirable in the short run. 
The practical reason is that the customer buying the encoding scheme may wish 
to enlarge the source alphabet at some future time. 


4.2.8 When €; = €2=---= ly = £, i.e., when the scheme is to be fixed-length, 
then the condition given in Kraft’s Inequality for the existence of a prefix- 
condition code simplifies to n’ > m. Since n* =|A®| and m =|S|, this condition 
is also seen to be necessary and sufficient by Proposition 4.2.2. 


4.2.9 Theorem (McMillan’s Inequality) If |S| =m, |A| =n, and s; > w; € 
A%,i=1,...,m, is an encoding scheme resulting in a uniquely decodable code, 
then vn" <1, 


Proof: Without loss of generality, assume that ¢,, is the largest of the ¢;. For 
any positive integer k, 
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Seta (D4) 


i=1 ij=1 ip=l1 


h(r) 
(Ci, + +€i,) 
= et ay a 
ij=lin=l in=l r=1 
where /(r) is the number of times r occurs as a sum ¢;, +--+ + €;,, aS ij,..., ik 
roam independently over {1,...,m}. 

Ifr=€;,+---+4;,, thenr =Igth(wj, --- w;,). By the assumption of unique 
decodability, if (i1,...,ix) A (ij,..-,4;), then wj,--- wi, A Wit ++ Wit. This 
means that the function (i},...,i¢) > wi, ++: wi, from {(i1,...,ix); |<ij <m, 
j=1,...,k and way =r} into A” is an injection; since the size of the 
domain is A(r) and the size of A’ is n’, it follows that h(r) <n’, and we have 


m 


AG) = 
a ae 
ral 


i=1 
so Lyn < kV/keil* 5 1 ask = 00. oO 


The elegant proof of McMillan’s Inequality given here is due to Karush 
[38]. 


4.2.10 The encoding scheme in the hypothesis of Theorem 4.2.9 is not as- 
sumed to be a prefix-condition scheme. Thus McMillan’s Inequality improves 
the “only if” assertion of Kraft’s Inequality. 


4.2.11 Corollary (of 4.2.6 and 4.2.9) Suppose |S| =m, |A| =n, and @,..., 
£m are positive integers. The following are equivalent: 
(a) there is an encoding scheme s; > w; € A&,i=l,...,m, resulting in a 
uniquely decodable code; 
(b) there is a prefix-condition encoding scheme s; > wj € A&, i=l,...,m; 


(Caer ae F 


The moral is that if unique decodability and prescribing the lengths of the 
wj; in the scheme are the only considerations, then there is no reason to consider 
anything except prefix-condition (or, in some countries, suffix-condition) codes. 


Exercises 4.2 


1. Suppose |S| =m, |A| =n, and we are considering only fixed-length encod- 
ing schemes of length 2. The resulting code is to be uniquely decodable. 
Find (a) the smallest value of £ possible if m = 26 and n = 2; (b) the small- 
est value of £ possible if m = 26 and n = 3; (c) the smallest value of n 
possible if m = 80 and ¢ < 4; (d) the largest value of m possible if n = 2 
and £ < 6. 
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2. Prove that if a code, given by an encoding scheme, is uniquely decodable, 
with reading-left-to-right a VDR for the code, then the scheme satisfies the 
prefix condition. (Hint: prove the contrapositive. That is, start by suppos- 
ing that the scheme does not satisfy the prefix condition, and prove that 
either reading-left-to-right is not a VDR for the code, or the code is not 
uniquely decodable.) Give an example of a scheme that does not satisfy the 
prefix condition for which reading-left-to-right does provide a VDR. 


3. Suppose that |$| =m > 2 and |A| =n > 2. For reasons that may become 
clear later, we will say that a non-decreasing sequence f; <--- < ly» of 
positive integers is an n-ary Huffman sequence if there is a prefix-condition 


encoding scheme s; > w; € A‘, j =1,...,m, but if any of the £; is 
reduced by one and the new sequence is denoted Lae Sites bees then there is 
no prefix-condition scheme s; > wi € A‘, j=1,...,m. [Convention: 


AQ= %.] Find the n-ary Huffman sequences £; <--- < & when 


(a) n=2andm=5 
(b) n=3 andm=5 
*(c) n=2 andm = 26. 


4. Suppose S = {a,b,c,d,e} and A = {0, 1}. Find a prefix-condition encoding 
scheme for S' in terms of A, corresponding to each of the sequences you 
found in 3(a), above. 


5. Let |A| =n, |S| =m, and L = max)<j<m lgth(w;). Let us say that a scheme 


sj > wj € AT, j =1,...,m, is good if it results in a uniquely decodable 
code. 


(a) For fixed n and L, what is the largest value of m possible if there is to 
be a good scheme for S? 

(b) For fixed m and L, what is the smallest value of n possible if there is 
to be a good scheme for S$? 

(c) For fixed m and n, what is the smallest value of L possible if there is 
to be a good scheme for S? 


[Hint: in every case, the optimum is achieved with a fixed-length encoding 
scheme. | 


4.3 Average code word length and Huffman’s algorithm 


Suppose that s; > w; €¢ AT, i = 1,...,m, is an encoding scheme for a source 
alphabet S = {5),..., 5m}. Suppose it is known that the source letters 51,..., 5m 
occur with relative frequencies f|,..., fm, respectively. That is, fj is to be 


regarded as the probability that a letter selected at random from the source text 
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will be s;. It follows that 5°", fi = 1. We will refer to the fj as the relative 
source letter frequencies, or source frequencies, for short. 


Definition In the circumstances described above, the average code word length 
of the code defined by the encoding scheme is 


m 
=) filgth(w)). 
i=1 
4.3.1 Note that “average code word length” is a bit of a misnomer. The correct 
term would be “average length of a code word replacing a source letter.” 


4.3.2 @ is, in fact, the average value of the random variable “length of the code 
word replacing the source letter” associated with the experiment of randomly 
selecting a source letter from the source text. On the (dubious?) grounds that 
reading a section of source text amounts to carrying out the selection of source 
letters a number of times, it follows from Theorem 1.8.6 that the average, or 
expected, number of code letters required to encode a source text consisting of 
N source letters is 0N. 


Recall that the code letters are also the input letters of a channel. It may 
be expensive and time consuming to transmit long sequences of code letters; 
therefore, it may be desirable for @ to be as small as possible. It is within our 
power to make @ small by cleverly making arrangements when we devise the 
encoding scheme. What constraints must we observe? 

For one thing, we want the resulting code to be uniquely decodable; since ¢ 
is a function of the 2; = Igth(w;), it follows from Corollary 4.2.11 that we may 
as well confine ourselves to prefix-condition codes. 

This is the only constraint we will observe in this section; it is, happily, a 
simplifying constraint—it makes life easier to be confined to a smaller array of 
choices. In later sections, however, we will encounter other purposes that might 
be served in the construction of the encoding scheme. These other purposes are: 
good approximation of the optimal input frequencies of the channel and error 
correction. In no case do these matters require us to abandon prefix-condition 
codes, but they sometimes do conflict with the minimization of @. When there 
are more concerns to juggle than just the shortening of the input text, when 
compromises must be made, the methods to be described in this section may 
have to be modified or abandoned. 

Common sense or intuition suggests that, in order to minimize 2, we ought 
to have the frequently occurring source letters represented by short code words, 
and to reserve the longer code words of the scheme for the rarely occurring 
source letters. It is left to the reader to decide whether or not a proof of the 
validity of this strategy is required. Proofs are available, and, in fact, the validity 
is enshrined in a famous theorem. 


4.3.3 Theorem [Ch. 10, 28] Suppose that f, > fo >--- > fm and £; < 2 < 
-++< lm. Then for any rearrangement ¢‘,..., €/,, of the list €,,...,€m, eye 


mover Fee 
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Recall that, to obtain a prefix-condition encoding scheme s; > w;, j = 
1,...,m, with lgth(w;) = £;, where dial n—“i <1, we have no worries pro- 
vided €; < £9 <--- < ¢m. With €; in non-decreasing order, we happily choose 
our w; avoiding prefixes, without a snag. 

It follows that, given the source frequencies, it would be a shrewd first 
move to reorder the input alphabet, and, correspondingly, the f;, so that f| > 
f2 =-:+ => fm. We shall henceforward consider the f; to be so ordered. To 
minimize 2, we look among sequences £1,..., £m of positive integers satisfying 
Cy << lm and yn <1. 

One way to proceed would be to look among the minimal such sequences, 
the n-ary Huffman sequences defined in Exercise 4.2.3, and to select from those 
the one that makes @ the smallest. 


4.3.4 Example Suppose S = {a,b,c,d,e}, A= {0, 1}, and the source frequen- 
cies are fg = 0.25, fp =0.15, fo =0.1, fo =0.2, and fe = 0.3. We reorder the 
source alphabet: e,a,d,b,c. We look among minimal sequences (also called 
n-ary Huffman sequences) €. < fg < lg < lp < £¢. It is hoped that you found 
three such sequences, in doing problem 3(a) at the end of the preceding section: 


(1) 1, 2,3, 4, 4; 
(2) 1, 3,3, 3, 3; and 
(3) 2, 2, 2, 3, 3. 


The average code word lengths corresponding to these different sequences are 


1 = (0.3)1 4 (0.25)2 + (0.2)3 + (0.15)4+4+ (0.1)4 = 2.4, 
£2 = (0.3)1+ (0.25 +0.2+0.15 +0.1)3 = 2.4, and 
£3 = (0.3+0.25+0.2)2+ (0.15+0.1)3 = 2.25. 


Thus list (3) is the winner. An optimal encoding scheme: 


e— 00 
a-—> ll 
d— 10 
b—> 010 
c> O11. 


The process of finding all possible minimal sequences £; <--- < & satis- 
fying ys 1 n—‘i <1 can be algorithmized. This approach to minimizing @ is 
worth keeping in mind, especially since it is adaptable to “mixed” optimization 
problems, in which we want to keep @ small and serve some other purpose—for 
instance, we might like the input frequency of the code letters to be close to 
the optimal input frequencies for the channel (see Section 4.4). In such prob- 
lems we may agree to an encoding scheme that effects a compromise between 
(or among) the contending requirements; perhaps 2 won’t be as small as we 
could get, but it will still be quite small and our other purposes will be served 
reasonably well. 
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In making shopper’s choices in such mixed problems, it is not at all inef- 
ficient or unreasonable to have all the alternative schemes arrayed before us, 
among which to choose. If the numbers involved are not astronomical, and the 
time consumed is not prohibitive, especially since we are shopping for a “big 
ticket” item, it is reasonable to take the trouble to find an encoding scheme 
which, once chosen, will be installed and used for the foreseeable future. 

But mathematicians dislike “brute force” in making choices; the brute- 
force, shopping-in-the-warehouse approach suggested above may, in fact, be 
forced upon us in real life for some purposes, but what follows is faster and 
more elegant in the cases where minimizing 2 (with a prefix-condition scheme) 
is our only objective. 


4.3.5 Huffman’s algorithm We suppose that f; > fo >--- > fim, and that 
m =n-+k(n—1) for some non-negative integer k. This last requirement can 
be achieved by adding letters to the source alphabet and assigning source fre- 
quency zero to the added letters. Note that when n = 2 < m, this bothersome 
preliminary is unnecessary. 

Merge. If m =n, go to encode, below. Otherwise, form a new source 
alphabet with n + (k — 1)(n — 1) letters by merging the n source letters with 
least source frequencies into a single source letter, whose frequency will be 
the sum of the source frequencies of the merged letters. Thus, the new source 
alphabet is S’ = {s1,...,Sm—n,o} and the sj, 1< j <m—n, have frequencies 
fj, while o has frequency y7n—n i Sj: 

Note which letters were merged into o, and reorder S’ so that source fre- 
quencies are in non-increasing order. With S’ replacing S and with the new 
source frequencies replacing /|,..., fm, go to merge. 

Encode. We are here initially with a source alphabet S with n letters. We 
form a scheme by which these letters are put into one-to-one correspondence 
with the letters of A, the code alphabet. We will derive from this scheme an 
optimal encoding scheme for the original source alphabet S, by working our 
way back through the sequence of source alphabets obtained by merging. At 
each stage of the journey from S back to S, we obtain, from the current encoding 
scheme, an encoding scheme for the next alphabet (along the road back to S) by 
an obvious and straightforward procedure. Suppose that S” was obtained from 
S’ by merging, and suppose that we have an encoding scheme for S’””. Suppose 
that o € S” was obtained by merging s/,,,-..,5;4, € S’. Suppose that, in the 
encoding scheme for S”, the production involving o iso — w. Then the scheme 
for S’ is obtained from that for S” by replacing the single production o + w by 
the n productions 


/ 


St4] 


—> wa, 


/ 
Stin > Wan. 


4.3.6 Examples Consider the situation in 4.3.4, in which S = {a,b,c,d,e}, 
A={0,l} and fe =0.3> fo =0.25> fo =0.2> fp =0.15> fo =0.1. We 
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run Huffman’s algorithm. First merge (b and c merged into 01): 


S,= {e, a, O01; d} 
Frequencies: 0.3, 0.25, 0.25, 0.2 


Second merge (o; and d merged into 02): 


Sy= {o2, e, a} 
Frequencies: 0.45, 0.3, 0.25 


Last merge (e and a merged into 03) 


S3 = {03,02}. 
For the encoding, we obtain 
on 1 
ss: ee So: 4 e +00 
a >01 
and 
e >00 soe 
01 a->0Ol 
Sie ye ae eae 
o,;— 10 
reece b— 100 
c—> 101 


Note that the algorithm does, indeed, give a code with minimal @, by the 
work done in 4.3.4. Note that the encoding scheme is different from that given 
in 4.3.4, and that, in fact, there is no way to apply Huffman’s algorithm to this 
example to obtain the scheme of 4.3.4. This is because Huffman’s algorithm 
will result in the code words representing e and a having the same first digit. 

Let us apply Huffman’s algorithm with the same S and source frequencies, 
but with A = {0, 1, *}, i.e., with n = 3. Note that 5 = 3-+ 1-2, s0 we need not 
add any source letters with zero frequency (or, equivalently, we need not merge 
fewer than n letters on the first merge). 

The first and only merge: S, = {e,a,o} [d, b, and c are merged]. The 
schemes are given by S; :e > 0,a—> 1,o ~ xandS:e>0,a—> 1,d— x0, 
b— *1l,c— *x. 


The proof of the fact that Huffman’s algorithm always results in an opti- 
mal prefix-condition encoding scheme is outlined in Section 4.3.1 (filling in the 
details is left to the reader as Exercise 4.3.5). 

We conclude this section with the statement of a famous theorem of Shan- 
non which relates the 2 achievable by Huffman’s algorithm to the source en- 
tropy. The proof of this theorem is postponed until Section 5.4 where a sharper 
statement of the theorem is proven for only the binary case. However, the proof 
there can be easily modified to give an equally sharp theorem for all n > 2. 
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4.3.7 Noiseless Coding Theorem for memoryless sources Suppose |S| = m, 
|A| = n > 2, and the source frequencies are f},..., fn. Let H = 
— >", filog fi. For every encoding scheme for S, in terms of A, resulting 
in a uniquely decodable code, the average code word length ¢ satisfies 


€>H/logn. 
Furthermore, there exists a prefix-condition scheme for which 
2<H/logn+1. 


The first inequality above, 2 > H/logn, has an interpretation that makes 
the result seem self-evident, if you do not look too closely. Let the base of 
the logarithm be n, the size of the code alphabet. Then the inequality becomes 
€> H. Also, setting this base for the logarithm defines the unit of information: 
each code letter can carry at most one unit of information. (See Section 2.1.1.) 

Now, H is the average number of information units per source letter and 2 
is the average number of code letters per source letter arising from the encoding 
scheme. If we have unique decodability, no information is lost; so the average 
amount of information carried by the code words representing the source letters, 
which is @ units at most, must be at least as great as H, the average number of 
units of information per source letter; for if the volume of a vessel is less than 
that of the fluid that is poured into it, there will be spillage. 

The fact that there is a rigorous mathematical proof of this inequality is 
further evidence that Shannon’s definition of information is satisfactory on an 
intuitive level. 


Exercises 4.3 
1. Suppose S = {a, b, c, d, e, f, g} and the source frequencies are given in the 
following table: 


letter | a b c d e f g 
freq | .2 112 08 15 25 1 it 


Use the Huffman encoding algorithm to obtain an optimal prefix-condition 
scheme for S when 


(a) A= {0, 1} 
(b) A= {0, 1, x}. 


2. Table 4.1 gives the relative frequencies, in English prose minus punctuation 
and blanks, ignoring capitalization, of the alphabetic characters a, b, ... , Z, 
estimated by examination of a large block of English prose, believed to be 
typical. This table is copied, with one small change, from [6, Appendix 1]. 
Find an optimal (with respect to average code word length) prefix-condition 
encoding scheme for S = {a, b, ... , z} if 


(a) A= {0,1}; 
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Table 4.1: Single-letter frequencies in English text. 


[ Character | % Freq [| Character | % Freq | 


5 


a 
b 
Cc 
d 
e 
f 
g 
h 
i 
J 
k 
l 
m 


NX © © < Gewr oso 


(b) A= {0, 1, x}. 
(c) What are the lengths of the shortest fixed-length encoding schemes, re- 
sulting in uniquely decodable codes, for S, in cases (a) and (b), above? 


*3. (For those with calculators and some free time.) Verify the conclusion of 
Theorem 4.3.7 in the circumstances of the preceding exercise. 


4. Suppose S = {a, b, c, d, e, f} and the source frequencies are given in: 


letter | a b c d e f 
freq | .2 15 05 2 25 115 


Use Huffman’s algorithm to encode S — A when (a) A = {0, 1} and (b) 
A = {0, 1, *}. Did you notice that there were choices to be made in running 
the algorithms in the “merge” part of the process? Run the algorithm in 
all possible ways, in (a) and (b), if you haven’t already. In each case, you 
should arrive at two essentially different schemes, essentially different in 
that the sequences of code word lengths are different. However, in each 
case, £ is minimized by both schemes. 


*5. Establish the validity of Huffman’s algorithm by filling the gaps in the proof 
given in Section 4.3.1, below. 


*6, Suppose that £; <--- < ¢ is an n-ary Huffman sequence. Show that £ = 


Vie fj; is minimal among the numbers A(x1,...,%m) = Do) fix). 
where x1,...,Xm are integers satisfying ya <1, if the f; are de- 
fined by fj = n—4i ‘oman n—“)—!, j =1,...,m. [You may as well assume 


that xj > --- > Xm is an n-ary Huffman sequence. There are only finitely 
many of these, and thus only finitely many values of Go =), n~*' to 
consider. Fix one of these, and now suddenly allow the x; to vary freely, 
even into non-integer values, but subject to the constraint Go = 7", n7~*i. 
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Use the Lagrange multiplier method to attempt to find where h(x1,..., Xm) 
achieves its minimum, subject to this constraint. You will find that when 
Go#G= >> 4 n—“, the minimum is not achieved at an n-ary Huffman 
sequence x1,...,Xm, and when Go = G, the minimum is achieved when 
pal = 1.2m] 


4.3.1 The validity of Huffman’s algorithm 


In this section we will try to lead whomever is interested through a proof of 
the validity of Huffman’s algorithm. In fact, we will prove more: not only 
does Huffman’s algorithm always give a “right answer,” but, also, every “right 
answer,’ in case there is more than one, as in problem 4, above, can be obtained 
by some instance of Huffman’s algorithm. By a “right answer’ here we do 
not mean any actual prefix-condition encoding scheme which minimizes @, but 
rather the sequence of lengths of the code words in such an encoding scheme. 
(That Huffman’s algorithm always produces a prefix-condition scheme is quite 
easy to see; we leave it to the reader to work through the proof.) 

There is a concise proof of the validity of Huffman’s algorithm in the binary 
case, in Huffman’s original paper [36], and this proof can be easily extended 
to prove the stronger statement given here, when n = 2. However, there are 
some unexpected difficulties that crop up when n > 2 that appear to necessitate 
a much longer proof. We have not seen a proof for n > 2 elsewhere. Both 
Huffman [36] and Welsh [81] give proofs for n = 2 and dismiss the cases n > 2 
as similar. Jones [37] notes that the case n > 2 is significantly different from 
the case n = 2 but does not give a proof for n > 2. 

Thanks are due to Luc Teirlinck for several of the observations on which the 
proof given here is based. Even more thanks are due to Heather-Jean Matheson, 
who, while an undergraduate at the University of Prince Edward Island, discov- 
ered a serious error in the purported proof in the first edition of this text. (She not 
only noticed that the logic of a certain inference was wrong, she demonstrated 
that it could not be made right, by giving a beautiful example. Unfortunately, it 
would take us too far afield to explain that example here.) Yet further portions 
of gratitude are due to Maxim Burke for elegantly fixing the error, in a way that 
improves the entire proof. The statements and proofs of Propositions 4.3.8 and 
4.3.9 are entirely due to him. 

Recall, from Exercise 4.2.3, that an n-ary Huffman sequence is a sequence 
£, <--- < £,, of positive integers such that there is a prefix-condition encoding 
scheme sj; > wj € A“i, j=1,...,n, for encoding an m-letter source alphabet 
S with an n-letter code alphabet A, minimal in the sense that if any of the ¢; is 


reduced by one and the new sequence is denoted ¢',,..., €/,,, there is no prefix- 


/ 
condition scheme s; > wi €E ACE j=1,...,m. [Convention: AQ = 0.] 
Notice that, given relative source frequencies f; >--- > fm > 0, any se- 
quence ¢; <--- < ¢,, of code word lengths for a prefix-condition scheme s; > 
w; € AU, j =1,...,m, which minimizes £ = ies fj; is an n-ary Huffman 
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sequence. (Why? In fact, the converse is true, as well: every n-ary Huffman 
sequence is the sequence of code word lengths in a prefix-condition scheme for 
S — A that minimizes @ with respect to some sequence f|,..., fm of relative 
source frequencies. See Exercise 4.3.6. But we will not need this fact here.) 

By Kraft’s Theorem (Theorem 4.2.6), a sequence €; <--- < €» of positive 
integers is an n-ary Huffman sequence if and only if it is minimal with respect 
to satisfying Kraft’s Inequality, p= 1 n—‘i <1. Since diminishing the largest 
of the €j increases the sum ae n— i the least, it follows that £) <--- < €m is 
an n-ary Huffman sequence if and only if 


ue vs1< de itn! — bm 


Therefore ae {2 i = | for any positive integers ¢),...,m implies that their 
non-decreasing rearrangement is n-ary Huffman. 


4.3.8 Proposition If 1 <¢, <--- <2 andn > 2 are integers and )~""_,n i = 
1, thenm =n+k(n —1) for some non-negative integer k, and €m—n41 = ++: = 


Lm. 
Proof: We go by induction on L = €. If L = 1 then )~”" j=" = = | implies 
m=n(k=0)andé,;=---=¢,=1. 
Suppose that L = ln > 1 and > (yj n—% =1, If K is the number of i 


such that 2; = L, then 1 = uct yt“ + K/n'; solving for K shows that 
K is a multiple of n. Since K > 1, this establishes the last conclusion of the 
proposition, that £; = L for the last n values of i. It remains to be shown that 
m=n+k(n—-1). 

Set K = an and set m’ = m —an, the number of indices i such that ¢; < 
L—1. We have 


By the induction hypothesis, m’ +a =n-+k'(n— 1) for some non-negative inte- 
ger k’. Thus m =m! +an =n+ (k’ +a)(n— 1), which has the desired form. 0 


4.3.9 Proposition If 1 < @; <--- <= L is ann-ary Huffman sequence and 
m=n+k(n—1)+t, where k is a non-negative integer and 1 < t <n—1, then 
whi + = 1 and 4 = ln = b, 


Proof: The second conclusion follows from the first and Proposition 4.3.8, ap- 
plied to the longer sequence €; <--- < €m4n—1-1 = L. 

Since €; <--- < £ is an n-ary Huffman sequence, YS pn <1 and 
clearly )°",n—“ is an integer multiple of n~/. Therefore, for some non- 
negative integer r, "jn" +4 T=. 


© 2003 by CRC Press LLC 


88 4 Coding Theory 


Ifr>n—1then PM yn tnl-!§ = ym nit at <p nit 
r = |, contradicting that 2; <--- < is ann-ary Huffman sequence. There- 
foreO<r<n-—-2. 

By Proposition 4.3.8, m+r=n-+k’'(n — 1) for some non-negative integer 
k’. Thusm =n+k(n—1)4+t=n+k'(n—1)—-ra=nt+(k’-Im—-1)4+(n 
1—r). Since both t andn — 1—,r are among 1,...,n—1, andt =n—1—r mod 
(n — 1), it follows that t =n —1—r,sor =n—1-—t, as desired. | 


4.3.10 Corollary If 1 < @; <--- <€m = L is an n-ary Huffman sequence, 
wherem =n+k(n—1)+t for integers k > 0 and 1<t <n-—1, then so is the 
non-decreasing rearrangement of €),,...,€,_;, Where ei =f, lag =m=t, 
andf),_,=L-1. 


Proof: 
m—t . m—t—1 
n= oa n "i + n/nt 
j=l j=l 
ya ap tL ASE 
Se eae eee 
j=l 
a _g, n—-t—l 
n 
j=l 


Corollary 4.3.10 allows us to provide a relatively easy proof by induction 
on m thatif fj >---> fin > 9, aes = |, and integers 0; <--- < & sat- 
isfying 7°”, n~% <1 minimize Vie fj€;, then some instance of Huffman’s 
algorithm applied to f|,..., fm with respect to a code alphabet A with n letters 
will produce an encoding scheme with code word lengths £1,...,&m. In these 
circumstances, if m <n we must have £; = --- = £,, = | and Huffman’s algo- 
rithm trivially gives the desired result. So suppose that m =n+k(n—1)+t 
for some integers k > 0 and t € {1,...,2— 1}; we go by induction on m. Note 
that although there may well be different instances of Huffman’s algorithm ap- 
plicable to f1,..., fm, based on different merging choices in the “merge” part 
of the algorithm, the first merge will invariably merge the t+ | source letters 
Sm—ts-++,5m into a letter 0, which will be given relative frequency a Sj. 

Let L = £,,. By the previous observation that £; <--- < € is an n-ary 
Huffman sequence and Proposition 4.3.9, we have that 0, = --- = €m—1 = L, 
and by Corollary 4.3.10, the non-decreasing rearrangement of €1,...,€m—1-1, 
L—1 is an n-ary Huffman sequence. We verify that these code word lengths 
minimize the average code word length of a possible prefix-condition code 
for S’ = {s1,...,S8m—1-1,0} — A with respect to the relative frequencies fi = 


Pisses Sta = fm-t-15 Sint = jam—1 Fj. Suppose that £/,...,€),-, are 
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positive integers such that 7", = i ni < land 


m—t ja—t=1 
dL fit= j= dX Fil + Ena y fi 
j=mat 
m—t—1 
< dX fej +(L—-V 3 fi- (+) 
j=m-t 
We have nF + Ft Yn Ent D = met nS 4 Bly tne < 
ae ee tnt < 1, showing that there is a prefix-condition encoding scheme 
for S — A with code word lengths £’,..., ¢ bap dy. be, eke 


m—t—1?~m—t 
But jot Lj + Ent D Dj fi < DG At L jm fi = 
ia j= Sj€; (by (*)), contradicting the assumed minimality of ot fie; 

By the induction hypothesis, there is an instance of Huffman’s algorithm 
resulting in a prefix-condition scheme for S’ — A with code word lengths 
£1,...,€m—1-1, L — 1 for 51,..., 5m—1-1, 0, respectively. Let u denote the word 
of length Z — 1 assigned to o in this encoding scheme. Then waj,...,uvar41 
will be the words of length L = £,_; = --- = £m assigned to Sy—;,..., Sm in the 
scheme obtained by the instance of Huffman’s algorithm consisting of preced- 
ing that for S’ > A by merging s_;,..., 5m. Thus some instance of Huffman’s 
algorithm results in an encoding scheme for S — A with code word lengths 
La, colin 

It remains to show that every instance of Huffman’s algorithm produces an 
optimal encoding scheme, with respect to the given source frequencies. In view 
of what has already been shown, this task amounts to showing that different 
instances of Huffman’s algorithm applied to relative source frequencies f| > 

-- > fm result in schemes with the same average code word length. We leave 
the details of this demonstration to the reader. Go by induction on m, and use 
the observation that ifm =n+k(n—1)+t, k>0, 1 <t<n-—1, then every 
instance of Huffman’s algorithm applied to f; >--- > fm, up to switching the 
order of source letters with equal source frequencies, starts with the merging of 
Sm—t,--+,5m into a new letter with relative frequency ae Fj. 


Exercises 4.3 (continued) 

7. Suppose thatm =n+k(n—1)4+t,k>0,1<t<n-1,1<€, <:--< 
£m = L are integers, and G = pe n—"., Show that €1,..., £m is an n- 
ary Huffman sequence if and only if G+ i <1<G+ ae (Note 
that Proposition 4.3.9 can be used for part of the proof, and provides the 
funny corollary that if the two inequalities above hold, then the leftmost 
is equality.) 
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4.4 Optimizing the input frequencies 


As before, let S = {51,..., 5m} be the source alphabet, and A = {a1,..., a} the 
code alphabet. A is also the input alphabet of the channel we plan to use. Sup- 
pose that the (relative) source frequencies f|,..., fm are known, and also the 
optimal channel input frequencies }),..., Py of the input letters a1,..., ay). We 
have the problem of coming up with a “good” encoding scheme, s; > w; € 
At, j =1,...,m. The goodness of the scheme is judged with reference to 
a number of criteria. We have already seen that for unique decodability, we 
may as well have a scheme that satisfies the prefix condition. For minimiz- 
ing = St fj lgth(w;), we have Huffman’s algorithm. Now let us consider 


the requirement that the input frequencies p1,..., Pn of the letters aj,...,ay 
should be as close as possible, in some sense, to the optimal input frequencies 
Pl. -++> Pn- 


In particular circumstances we can wrangle over the metric, the sense 
of “closeness,” to be used, and we can debate the rank of this requirement 
among the various contending requirements, but it is clear that we will make 
no progress toward satisfying this requirement if we cannot compute the input 
frequencies p1,..., Pn arising from a particular encoding scheme. This compu- 
tation is the subject of the following theorem. 


4.4.1 Theorem Suppose that sj; > wj € At, j =1,...,m, is an encoding 
scheme. Suppose that a; occurs exactly uj; times in w;j, i = 1,...,n, j = 
1,...,m. Then, fori = 1,...,m, 


i Mi fi oe m 
i= — _ 7 ee 
: ini Sj lgth(w;) © dh 


Proof: We will have a rather informal proof; logicians and philosophers can be 
hired later to dignify it. 

Suppose we have a block of source text with a large number N of source 
characters, with the marvelous property that, for each j = 1,...m, sj; occurs 
exactly the expected number of times, N fj. After encoding, the total number 
of characters in the code text is Vai (N fj) lgth(w;) = N&. The number of 
occurrences of a; in the code text is )""""_; wij (Nfj) = N D0 wij fj- Dividing, 
we find that the proportion of a;’s in the code text is 


N i 
p= tS ets -(- Lot Oo 


4.4.2 Example Suppose that § = {a,b,c}, A= {0,1} and fo =.6, fp = .3, and 
fc =.1. Suppose that the encoding scheme is 
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a— 00 
b— 101 
c— 010. 


This scheme does not minimize average code word length, but it may have 
compensatory virtues that suit the situation. Letting the alphabet characters 
serve as indices, we have 


Uda = 2, Up = 1, Ue = 2, 
and Ujg =0, Up = 2, Ue = 1. 


Thus the input frequencies will be 


- 260+ G3) 4261)) 7 
Py DCOY EIB SCLY 24 
and pj = 1 — po = 7/24. If the channel involved is a binary symmetric channel, 
then the optimal input frequencies are po = D1 = 1/2 (see 3.4.5), So po and p; 
here are quite far from optimal. 

The code designer may have had good reasons for the choice of this scheme. 
Would the designer agree to changing the second digit in each of the code words 
of the scheme? This would not change any lengths, nor the relationships among 
the code words. (You might ponder what “relationships” means here.) The new 
scheme: 


a—>0Ol 
b> 111 
c > 000. 


The new input frequencies: po = 3/8, pj = 5/8. These are not optimal, but 
they are closer to 1/2 than were the former input frequencies, 17/24 and 7/24. 
If the new scheme is as good as the original in every other respect, then we may 
as well use the new scheme. 


Optimizing the input frequencies, after minimizing 4, with a prefix-condition 
code 
4.4.3 Problem The input consists of S,A, the source frequencies f|,..., fin, 
and the optimal input frequencies pj,..., Pn for the channel of which A is the 
input alphabet. The output is to be an encoding scheme s; > w; € AT, j = 
1,...,m such that 
(i) the prefix condition is satisfied; 
(ii) 2 = ya f; lgth(w;) is minimal, among average code word lengths of 
schemes satisfying (i); and 
(iii) the n-tuple (p1,..., Pn), computed as in Theorem 4.4.1, is as close as pos- 
sible to (Pi, ..., Pn), by some previously agreed upon measure of close- 
ness. If d(p, p) denotes the distance from p = (pj,..., Pn) to D= (Pi, 
.-; Pn), this means that d(p, p) is to be minimal among all such numbers 
computed from schemes satisfying (i) and (ii). 
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4.4.4 Usually, d(p, p) = Via! (pj — Pj)”, but you can take 


n 
d(p,P) = > |pj— Bil 
j=l 
for some power a other than 2, or d(p, p) = maxi<j<n|pj — pj|. Whenn = 2, 
these different measures of distance are equivalent: for any choice of d, above, 
d(p, p) < d(p’, p) if and only if |p; — pi| < |p, — pil. See Exercise 4.4.4. 


It would be nice to have a slick algorithm to solve Problem 4.4.3, especially 
in the case n = 2, when the output will not vary with different reasonable defi- 
nitions of d(p, p). Also, the case n = 2 is distinguished by the fact that binary 
channels are in widespread use in the real world. 

We have no such good algorithm! Perhaps someone reading this will sup- 
ply one some day. However, we do have an algorithm; it’s brutish, but it’s an 
algorithm. Here it is: Supposing fi > f2 >--- > fm, use Huffman’s algorithm 
to find all n-ary Huffman sequences £; <--- < &» that minimize l= ei jej3 
for each of these sequences, we find all possible prefix-condition schemes s ; > 
wy e€ A“ and compute p; = (@)~! Vie i fi i=1,...,n. We choose the 
scheme for which (p1,..., Pn) is closest to (P1,..., Pn)- 


4.4.5 Example Let’s carry out the brute-force program suggested above in the 
easy circumstances of Example 4.4.2, assuming that the channel is a BSC. We 
have S = {a,b,c}, A= {0,1}, fa = 6, fp = 3, fe =.1, and po = pi) = 1/2. 
There is only one sequence of code word lengths to consider: €, = 1, €) = 
2 = &. We have @ = 1.4. There are four different prefix-condition schemes to 
consider; the two that start with a > 0 are: a > 0,b > 10,c > ll anda— 0, 
b— 11,c— 10. For the first of these, po = (1.4)~!(.6+.3) = 9/14, and, for the 
second, po = (1.4)~!(.6+.1) = 1/2. Clearly the second wins! Alternatively, 
the scheme a — 1, b — 00, c — O1 gives optimal input frequencies. 

With the same S, A, and source frequencies, if the channel had been so 
oddly constructed that po = 1/3, Pp; = 2/3, then the optimal scheme of the four 
candidates would have been a > 1, b > O1, c > 00. 


4.4.6 Example S = {a,b,c,d,e}, A= {0,1}, Po = Pi = 1/2, fe = 35, fa = 
3, fa =.2, fp =.-1, and f, = .05. The sequences (£¢, fa, la, lp, £c) satis- 
fying Djes2 < 1 for which @ = > fie; is minimal are (1,2,3,4,4) and 
(2,2,2,3,3). [See Exercise 4.2.3 and Example 4.3.4.] The value of £ is 2.15. 
Both optimal sequences are obtainable from Huffman’s algorithm; the differ- 
ence arises from the choice of the ordering of the alphabet obtained from the 
second merge. 

The optimal schemes in this case are associated with (2, 2,2,3,3). (There 
are quite a number of schemes to look at, but, taking into account that po = 
Pi = 1/2, the possibilities boil down to only eight or nine essentially different 
schemes.) Here is one of the optimal schemes: 
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e—> Ol 
a— 10 
d— 11 
b — 000 
c— 001. 


Verify that po = 1.05/2.15 = 21/43, and that this is as close to 1/2 as you can 
get in this situation. (Since each /f; is an integer multiple of .05, the numerator 


of po = aie will be an integer multiple of .05. Thus the closest po can be 


made to 1/2 is 1.05/2.15 or 1.10/2.15.) 

Observe that in this case we can have po = p1 = 1/2 exactly, with a prefix- 
condition scheme, if we sacrifice the minimization of £. For instance, the fixed- 
length scheme e — 0011, a > 1100, d > 0101, b > 1001, c > 1010 gives 
unique decodability and po = pi = 1/2. 


4.4.7 In general, whenever the optimal input frequencies pj,..., Py are ra- 
tional numbers, we can achieve exact input frequency optimization, pj = Dj, 
i=1,...,n, with a uniquely decodable block code; just make ¢ = lgth(wj), 
j=1,...,m so large that it is possible to find m distinct words w),...,Wm€A 
such that the proportion of the occurrences of a; in each is exactly p;. And, if 
some of the p; are irrational, we can approximate p = (P},..., Pn) by a ratio- 
nal vector (p1,..., Pn) = p (satisfying )°7_, pi = 1, pi > 0, i = 1,...,n) as 
closely as we wish, and then produce a fixed-length scheme from which the p; 
arise as the input frequencies of the a;. Thus the variables p1,..., py are truly 
“vary-able,” as we promised in Chapter 3, and arrangements can be made in the 
code, or input, “language,” so that the relative input frequencies are as close as 
desired to optimal. 

However, the method suggested in the preceding paragraph for approximat- 
ing the optimal input frequencies is clearly inpractical; the code words would 
have to be quite long, so that the rate of processing of source text would be quite 
slow, and increasing that rate is generally reckoned to be of greater consequence 
than the close approximation of the optimal input frequencies. 

In the same vein, one might well question the importance of Problem 4.4.3, 
although in this problem the approximation of the optimal input frequencies is 
subordinated to minimizing @ — i.e., to speeding up the processing of source 
text. As long as the scheme is uniquely decodable and @ is minimized, why 
fiddle with trying to approximate the optimal input frequencies? Optimizing the 
average amount of information conveyed by the channel per input letter, with 
the input stream somewhat artificially regarded as randomly generated, may 
seem an ivory-tower objective, an academic exercise of doubtful connection to 
the real world problem of communicating a source stream through a channel. 

However, it is an indirect and little-noted consequence of the famed Noisy 
Channel Theorem, to be explained in Section 4.6, that there is a connection be- 
tween the practical problems of communication and the problem of encoding 
the source stream so that the input frequencies are approximately optimal. Not 
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to go into detail, the import of the NCT is that there exist ways of encoding the 
source stream that simultaneously do about as well as can be done regarding 
the two most obvious practical problems of communication: keeping pace with 
the source stream (up to a threshold that depends on the channel capacity), and 
reducing the error frequency, in the reconstitution of the source stream (decod- 
ing) at the receiver of the channel. Although it is not explicitly proven in any of 
the rigorous treatments of the NCT, the role of the channel capacity in the NCT 
strongly argues for the information-theoretic folk theorem that the relative input 
frequencies resulting from those wonderful optimizing coding methods whose 
existence is asserted by the NCT must be nearly optimal, themselves. 

This folk theorem is of particular interest when you realize that all known 
proofs of the NCT are probabilistic existence proofs; there is no good construc- 
tive way known of acquiring those coding methods whose existence is proved. 
Furthermore, when you understand the nature of those methods, you will un- 
derstand that they would be totally impractical, even if found. [The situation 
reminds us of contrived gambling games in which the expected gain per play 
is infinite, yet the probability of going bankrupt due to accumulated losses is 
very close to one, even for Bill Gates.] So the problem of effective coding 
realizing the aspirations expressed in the NCT is still on the agenda, and has 
been for the 54 years (as this is written) since Shannon’s masterpiece [63]. So 
far as we know, the indirect approach of aiming, among other things, to get 
close to the optimal relative input frequencies by astute coding has not been a 
factor in the progress of the past half-century. In part this has to do with the 
fact that binary symmetric channels are the only channels that have been seri- 
ously considered; also, it has been generally assumed that the relative source 
frequencies are equal (see the discussion, next section, on the equivalence of 
Maximum Likelihood Decoding and Nearest Code Word Decoding), and the 
dazzling algebraic methods used to produce great coding and decoding under 
these assumptions automatically produce a sort of uniformity that makes po and 
Pi equal or trivially close to 1/2. Perhaps the problem of approximating the op- 
timal input frequencies by astute encoding will become important in the future, 
as communication engineering ventures away from the simplifying assumption 
of equal source frequencies. 


Exercises 4.4 


1. We return to 4.3.4: S = {a,b,c,d,e}, fe =0.3, fg =0.25, fo = 0.2, fp = 
0.15, and f, = 0.1. Find a scheme which solves the problem in paragraph 
4.4.3 when 


(a) A= {0,1}, po = 1/2= i; 

(b) A= {0,1}, Po = 2/3, pi = 1/3; 

(c) A= {0,1,*}, Po = Pl = Px = 1/3; 

(d) A= {0,1,*}, Po = Pi = 2/5, Ps = 1/5. 
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2. In each of (a)-(d) in the preceding problem, find a uniquely decodable 
fixed-length scheme which gives the optimal input frequencies exactly. The 
shorter the length, the better. 


3. When |S| =m = 26, find the shortest length of a fixed-length prefix-condi- 
tion scheme, by which the optimal input frequencies are realized exactly, 
constructed as suggested in 4.4.7, when the code alphabet and optimal input 
frequencies are as in 1(a)—(d), above. [Notice that the method suggested in 
4.4.7 takes no account of the source frequencies.] Compare with the 2 you 
found in exercise 4.3.2 (a) and (b). 


4. Verify the assertion about the case n = 2 made in 4.4.4. [Hint: observe that 
if pi + p2 = pi + ps = pit po, then if |p; — pi| < |p), — Pil, it follows 
that |p2 — p2| < |p — pri, since |p: — pi| = |p2 — p2| and |p) — pil = 


lps — pal.) 
5. Suppose the source text is encoded by the scheme sj > wj € A*,j = 
1,...,m, the source frequencies are f|,..., fm, and the u;; are as in The- 


orem 4.4.1. We select a letter at random from the source text and look 
at it; if it is s;, we then select a letter at random from w;. What is the 
probability that a; will be selected by this procedure? Is this the same as 
(@)7! 1 ui; fj? If not, why not? 


6. This exerise concerns the efficiency of the brute-force algorithm suggested 
for solving Problem 4.4.3. 


(a) How many prefix-condition binary encoding schemes are there with 
code word lengths 2,2, 3,3, 3,3? 

(b) How many prefix-condition binary encoding schemes are there with 
code word lengths 2,2, 2,3,4,4? 

(c) How many prefix-condition ternary (|A| = 3) encoding schemes are 
there with code word lengths 1,2,2,2,2,2? 

*(d) Given |A| = 2 > 2 and positive integers £; <--- < €» satisfying 
ei n—“i <1, give a formula, in terms of n and ¢1,...,&m, for the 
number of different prefix-condition encoding schemes for S > A, 
S = {51,...,5m}, with code word lengths £1,...,&m. [See Section 1.6 
and the proof of Kraft’s Inequality. ] 


4.5 Error correction, maximum likelihood decoding, 
nearest code word decoding, and reliability 
Let S = {s1,..., 5m} and A = {a),...,a,} be as in the preceding sections, and 


suppose that B = {bj,..., bx} is the output alphabet of the channel of which 
A is the input alphabet, the channel that we plan to use. Let Q = [q;;] be the 
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matrix of transition probabilities of this channel. Let f|,..., fim be the (relative) 
source frequencies of s1,...,5m, respectively. We shall consider only fixed- 
length encoding schemes s; > wj € A®, j =1,...,m, with the wj distinct, as 
nature usually demands; so m < ne. 

Suppose we are trying to convey some source letter s; through the channel. 
What we really send is the sequence w; of input characters. The channel does 
whatever it does to w;, and what is received is a word w € B* of length 2. You 
are at the receiving end. You know the encoding scheme. The input string has 
been timed and blocked so that you know that the output segment w resulted 
from an attempt to transmit one of w1,...,Wm. How are you going to guess 
which w, (and thus, which s;) was intended? 


4.5.1 We are in a conditional probabilistic situation not unlike that of the man 
who draws balls from an urn in a dark room, and later wonders which urn he 
drew from (see 1.3.2 and exercise 1.3.5). 

Surely the most reasonable choice of w; is that for which the conditional 
probability 


. P(wj was sent and w was received) 
P(wj was sent | w was received) = ————T__—_W___— 
P(w was received) 


is the greatest; thus it behooves us to inspect the numbers 
P(w;,w) = P(wj; was sent and w was received). 


These are calculated as follows. Suppose that w; = aj(1,;)---dice,j); that is, 
suppose that aj 5, ;) € A is the sth code letter in w;, reading left to right. Suppose 
that w = b,, ---b;,. Then 


P(w;,w) = P(wj; was sent ) P(w was received | w; was sent) 


= fjQi(1,j). + ie, f).te 


£ 
= fj] [aie.i.s 
s=1 


4.5.2 Example Suppose that § = {a,b,c}, A = {0,1}, B = {0,1,*}, fa =.5, 
Sb = 3, fc = .2, and the transition probabilities are 


qoo 01 ox} _| -9 06 .04 
qio il 1x 05.92 .03]° 
Suppose that the encoding scheme is a > 00, b > 11, and c > 01. Suppose 
that w = 0x is received. Then 
P(00, O*) = faqoogox = (.5)(.9)(.04) = .018, 
P(11,0*«) = foqiogix = (.3)(.05)(.03) = .00045, and 
P(01,0*«) = fceqoogix = (.2)(.9)(.03) = .0054. 


Of these, .018 is the greatest; having received 0-, we would bet that 00, the code 
word for a, was intended. 
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The practice of decoding received words w € B* by choosing the j for 
which P(w;, w) is the greatest is called maximum likelihood decoding, or MLD, 
for short. [In case there are two values of j for which P(w;, w) is maximal, 
and in case we choose not to decode such w, the practice is sometimes called 
incomplete maximum likelihood decoding, or IMLD. We will stand by MLD, in 
this text, even though we will not, in fact, decode w in case of ties.] 


The MLD Table In a real situation, we would take care of the drudgery of 
computing and comparing the numbers P(w;, w) before attempting to decode 
words received at the receiving end of the channel. The results of this computing 
and comparing are collected in an MLD table, which consists of two columns: 
on the left we list all k£ = |B"| words w that might possibly be received, and 
on the right the source letters s; corresponding to the w; for which P(w;, w) 
is maximal. That is, opposite each possible received word w € B“, we put the 
source letter s recommended by MLD for decoding w.! 

It should be obvious how the MLD table is to be used at the receiving end 
of the channel. It functions as a dictionary, and for that reason the words w € B* 
should be arranged in some reasonable lexicographic order for rapid “looking 
up.” In practice, the process of looking up w and decoding will be electronic. 


4.5.3 Example In the situation of Example 4.5.2, the MLD table (with B® ar- 
ranged in one of the two obvious lexicographic orders) is 


(Receive) w | Decode s 


00 a 


— 
* 
gaovra org 8&9 


(Verify that this table is correct.) Thus, if «0 were received, we would quickly 
decode a. 

It may come as a surprise that MLD, as described, is virtually never used in 
current practice, except when it coincides with Nearest Code Word Decoding, 
to be described next. 


Definition Suppose that B is an alphabet, and that u,v € B“ for some positive 
integer £. The Hamming distance between u and v, denoted dy (u, v), is the 
number of places at which u and v differ. 


Tp case there are two or more values of J for which P(w pe w) is maximal, tie-breaking rules 
can be introduced to choose among the candidates s;. These rules might arise from considerations 
peculiar to the particular situation. In this text, we will enter “dnd” for “do not decode” in the 
decoding column, in case of ties. 
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For instance, if B = {0, 1, *}, then dy (01101, 111 *0) = 3. Verify that dy 
is a metric on B®; that is, for any U,v,w € Bé, dy(u,v) = dy(v,u) > 0, with 
equality if and only if u = v, anddy(u,v) <dy(u,w)+dy(u,v). 


4.5.4 The metric dy represents one reasonable way of defining the distances 
between the words of B’. We could jazz things up considerably, and provide se- 
rious mathematicians with hours of fun, by generalizing the Hamming distance 
as follows. Let 6 be any metric on B, and let ¢ : [0, 00)" > [0, 00) satisfy: 


(i) O<xj)< yw,i=1,...,6 > o(1,..., xe) < PU, ---, Ye)s 
(ii) p(x +y) < p(®) +p), for x, y € [0, 00)"; and 
(iii) p(x) =O if and only if x =0 € [0, 00). 
Then the generalized Hamming distance between words u = u,...ue € B® and 
v = v1... ve € B®, associated with 5 and p, is dgy (u,v) = p(6(u1, v}),..., 


d(ue, ve)). Observe that dg 7 = dy when 6 is the so-called trivial metric defined 
by 


1, ifa#b, 
sa.)= {4 fab 


and ¢ is defined by o(x1,...,x¢) = paar Xj. 
This sort of thing is amusing, but is it useful? So far as anyone can tell, at 
this stage of history, no. However, see Exercise 4.5.3. 


Nearest Code Word Decoding (NCWD) Suppose that A C B, and we have a 
fixed length encoding scheme s; > wj € A. In NCWD, having received w € 
B®, we decode w as that s; (if there is exactly one such) for which dy(w;, w) 
is least. If there are two or more j such that dy(wj;, w) is minimal, we do not 
decode. 

For example, if S = {a,b,c}, A = B = {0, 1}, and the encoding scheme is 
a — 000, b + 111, c > 010, then if 100 is received, we decode a in NCWD. 
If 011 is received, we do not decode in NCWD, because both 111 and 010 area 
(minimal) distance one from 011. 

NCWD is a much easier decoding method than MLD. It is not just that 
it is easier to make up a decoding table or dictionary under the “nearest code 
word” criterion, than under the “most likely to have been sent” measure; the 
act of comparing words to determine their Hamming distance apart is so simply 
algebraic that it lends itself to slick, fast decoding algorithms that run much 
faster than the “looking up w on a list” method that we use for MLD. Instances 
of clever choices of w1,..., Wm leading to clever NCWD algorithms are beyond 
the scope of this course, but they form the majority of the subject matter of 
advanced algebraic coding theory (see [30]). 

But the simplicity of NCWD comes at a price; NCWD ignores the transi- 
tion probabilities and the source frequencies, and is therefore possibly unreli- 
able. See Exercise 4.5.2(a), and note the difference between NCWD and MLD 
in this case. What if the infrequently transmitted source message were “launch 
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the missiles” or some such apocalyptic command? Pretend that you are the 
manager in charge of communications at a missile silo emplacement in North 
Dakota, and the source frequencies are not, in fact, known a priori, as is often 
the case. Pretend that you have a Master’s Degree in Applied Mathematics from 
a large southern state university, and have taken a couple of courses in coding 
theory that get into sophisticated codes and decoding algorithms that are, in fact, 
NCWD. Will you “use your education” by resorting to some satisfyingly fancy 
form of NCWD for transmissions to and from the missile silos, for messages 
ranging from “Fred, please pick up a bunch of parsley and a pound of Vaseline 
on the way home” to “Arm and launch immediately!”? Let’s hope not. The 
scene is fanciful, but there is an important point, which we hope you get. 

So, when can you depend on NCWD? The following theorem gives a suf- 
ficient condition. 


4.5.5 Theorem Suppose that A = B, that the relative source frequencies are 
equal (to 1/m), that qi, =q > 1/n,i =1,...,n, and that the off-diagonal tran- 
sition probabilities qj;,i # j, are equal. Then NCWD and MLD are the same, 
whatever the fixed-length encoding scheme. 


Proof: Under the hypotheses, q;; = (1 — ¢)/(n — 1) fori # j. For any scheme 
Sj > Wj € A®, j =1,...,m, and any we B= A®, 


l=ga. 9°, I=q <a 
aad) i =) , 


where d = dy(w;,w). Since 1/n < q < 1, it follows that 0 < an < 1, and 
thus P(w;, w) is a decreasing function of d= dy(w;,w). If q = 1, the channel 
is perfect, and NCWD and MLD coincide trivially. If g < 1, then P(w;, w) is 
a strictly decreasing function of d = dy(wj;, w); consequently, the unique j, if 
any, for which dy(w;, w) is minimal, is also the unique j for which P(wj;, w) 
is the greatest. Oo 


ih, ge 
P(wj,w) = —q" a 


4.5.6 Corollary When the channel is a binary symmetric channel, with reli- 
ability greater than 1/2, and the relative source frequencies are equal, then 
NCWD and MLD are the same. 


4.5.7 The hypothesis of Theorem 4.5.5, regarding the transition probabilities, 
says that the channel is an n-ary symmetric channel with reliability g > 1/n 
(see section 3.4). It is quite common to know, or strongly suspect, a channel to 
be n-ary symmetric without knowing the reliability. 


The hypothesis of Theorem 4.5.5 regarding the source frequencies raises 
a philosophical question about probability: if the source frequencies are not 
known a priori, should we take them to be equal? Clearly whatever knowledge 
we have should affect our estimates of probability—for instance, although we 
do not know for sure that the sun will rise tomorrow, it would be rash to as- 
sign probability 1/2 to the possibility that it will not. So we are in delicate 
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circumstances when we have a binary symmetric or, more generally, an n-ary 
symmetric channel and a data communication problem in which we do not know 
beforehand how frequently the various data are likely to be transmitted. Should 
we attempt to complicate the decoding process by taking into account our sense 
of likely bias in the frequencies of the source messages? And, if so, how? 

The usual answer is to ignore the problem and to resort to NCWD. The 
usual theory of binary block codes starts from the implicit assumption that the 
relative source frequencies are equal; indeed, the source alphabet and the en- 
coding scheme are not elements of the theory, nor even mentioned. We mention 
this as a caution to future appliers of coding theory. 

It can be persuasively argued that the alleged weakness of NCWD—that it 
leaves the source frequencies and the transition probabilities out of account—is 
actually a practical strength. The argument rests, not on the ease of NCWD, but 
on the undebatable fact that in many situations the source frequencies and the 
transition probabilities are fictional quantities, unknown and unknowable. 


Reliability and Error Given a source, a channel, a way of encoding the source 
stream into a string of channel input letters, and a method of decoding the output 
at the channel receiver, so that for each letter appearing in the original source 
stream, some source letter will appear in its place in the hopefully resurrected 
source letter stream emerging from the decoder: the reliability R of the given 
code-and-channel system is the probability that a letter randomly selected from 
the source stream will be decoded correctly at the receiver—i.e., R is the proba- 
bility that a randomly selected letter from the source stream will be successfully 
communicated by the code-and-channel system. 

The (average) error probability of the code-and-channel system is E = 1 — 
R. The maximum error probability of the system, denoted E, is the maximum, 
over the source letters s, of the probability of an error at a randomly selected 
spot in the source stream, supposing that s occupied that spot in the original 
stream. 

We will calculate R, E, and E in the circumstances that allow full-fledged 
MLD. That is, suppose we are given S, f|,..., fm, A, B, Q, and a fixed-length 


encoding scheme s; > w; € AS, jJ=1,...,m, for S— A. 
Given an MLD table for the code-and-channel system, the reliability R can 
be calculated as follows. For each j = 1,...,m, let 


Nj={we B‘: MLD decodes w as sj}. 


The N; can be read off from the MLD table; Nj; consists of those w in the 
left-hand column of the table opposite the occurrences of s; in the right. Then 
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( s; is selected for transmission, ) 
and the word w received lies in Nj; 


> 
ll 

Ms 
> 


m 
= > f; P (transmission of w; results in a received word w € Nj) 


= fj sy P(w was received | w; was sent) 


j=l weN; 
m 

=Difi 2, Pwl wy), 
j=l weN; 


where, as noted in 4.5.1, P(w | wj) = Me, qi(s,j),ts> when Wj = ic, j) eA qi(e,j) 
and w = by “Dy. 


Example In Example 4.5.3, we have Ny = {00, 0x, 10, +0, *«}, Np = {11, 1x, 
*1}, and N. = {01}. Thus 


R = .5[(.9)* + (.9)(.04) + (.06) (.9) + (.04)(.9) + (.04)7] 


+ .3[(.92)? + (.92)(.03) + (.03)(.92)] + .2£(.9) (.92)] 
= .90488, 


and EF = 1— R=0.09512. 


The maximum error probability E is calculated as follows: 


E= max P(incorrect decoding | s; was intended) 
l<j<m 


= ymax P(the received w € B © does not lie in Nj | w; was transmitted) 
<j<m 
=1— min P(w€ Nj; | w; was transmitted). 
1l<j<m 

Notice that, for each j, P(w € Nj; | w; was transmitted) is the quantity mul- 
tiplied by f; in the expression for R, above. For instance, again referring to 
the circumstances of Examples 4.5.2 and 4.5.3, we calculate that c has the least 
likelihood, 0.828, of being correctly transmitted, and thus, for that code and 
channel, E = 1 — .828 = .172. 

Observe that £ does not take the source frequencies into account (although 
they do enter anyway, in the construction of the MLD table). It is a “worst-case” 
sort of measure of error likelihood. 


4.5.8 For many code-and-channel systems we have “do not decode” occurring 
in the right hand column (the decode column) of the MLD table, corresponding 
to words w € B® for which there are two or more J for which P(w;,w) is 
maximal. With such a table, we have a number of choices to make in assessing 
the likelihood of error. Should a “do not decode” message, which surely signals 
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some sort of failure of the system, weigh as much in our estimation as an out- 
and-out error, in which we decode the wrong source message from the received 
word w? In the definitions of E and E , above, the two different sorts of error are 
treated as the same, but most people would agree that in most situations the “do 
not decode” message is a less serious sort of error than an incorrect decoding of 
which we are unaware. 

There are an endless number of ways of weighting the significance of the 
various errors that could occur, using any particular code-and-channel system. 
You should be aware that the definitions of E and E given here are not graven 
in stone, and that in the real world you might do well to fashion a measure of 
error likelihood appropriate to the real situation, a measure which takes your 
weighting of error significance into account. See the end of this section for 
exercises on error weighting, and on the computation of reliability and error 
when NCWD is used. 


Reliability of a channel Suppose we have a channel with input alphabet A, 
output alphabet B, and transition probabilities g;j,i =1,...,n, j=1,...,k. Let 
us adjoin a code by taking S = A, fj = pj, j =1,...,n, where (p},..., Pn) is 
some n-tuple of optimal input frequencies, and the encoding scheme a; — aj, 
j=1,...,n. The reliability R (with respect to MLD) of the resulting code-and- 
channel system will be called the reliability of the channel. (Perhaps the definite 
article is not justified here when (p},..., Pn) is not unique; we pass over this 
difficulty for now.) 

In the case of an n-ary symmetric channel, one satisfying the hypothesis of 
Theorem 4.5.5, we can take (p1,..., Pn) = (1/n,..., 1/n), and then the equal- 
ity of the imposed source frequencies f,..., f, implies that MLD and NCWD 
coincide, by Theorem 4.5.5. Clearly, for each a; € A, the unique word over A 
of length | closest to a; is a; itself. That is, Nj = {a;}. Thus 


poe y P(a;i ived | a; is sent) 
= a; 1S recelve: qa; 18 sen 
n 4 J J 


ay ew 
= 2 qi oo 
j=l 


the constant main diagonal entry in the matrix of transition probabilities. 

Consequently, the definition of channel reliability given here agrees with 
the prior definition of the reliability of an n-ary symmetric channel, at least 
when that reliability is greater than 1 /n. 

When the optimal input frequencies are difficult to obtain, and the channel 
is “close” to being n-ary symmetric (see the discussion in section 3.4), a rough 
estimate of the reliability of the channel may be obtained by taking the source 
frequencies to be equal (to 1/n). For instance, consider the channel described 
in Example 4.5.2. With respect to the the encoding scheme 0 —> 0, 1 > 1, we 
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have No = {0, *} and N; = {1} (since P(O,*) = 5(.04) > P(1,*) = 5(.03)), 


whence R © 5[.9 + .04] + 51.92] = .93. 


The reliability of a discrete memoryless channel, as defined here, appears 
to be a new index of channel quality. Its relation to channel capacity has not 
been worked out, and it is not yet clear what role, if any, reliability will play in 


the theory of communication. 


Exercises 4.5 


1. In each of the following, you are given a source alphabet S, a code (and 
input) alphabet A, an output alphabet B, source frequencies f),..., fm, an 


encoding scheme, and the matrix Q of transition probabilities. In each case, 
produce (i) an MLD table, (ii) the reliability R, and (111) the maximum error 
probability of the code-and-channel system. 


(a) S = {a,b,c}, A= {0,1}, B= {0,1,*}, fo = .4, fo = 35, fc = .25, 
a— 00,b— 11,c > 01, and 


Q =] 4 For Fox | _ 8 15.05 
Gio fll 1x 1.86 .04]° 
(b) S = {a,b,c,d,e}, A= B = {0,1, x}, ta = .25, fo = .15, iz — 05, 
fa =.15, fe = .4,a — 00, b > O1, c > Ox, d > 10, e > 11, and 


goo Ol x 95 .03 .02 
Q=/q10 qu dix |=} .04 .92 .04 


(c) S = {a,b,c,d,e}, A = B = {0,1}, the source frequencies are as in 
(b), the encoding scheme is a — 000, b > 001, c > 010, d > O11, 
e — 100, and the channel is binary symmetric with reliability .9. 


. Suppose the available channel is binary symmetric with reliability .8. 
Suppose S = {a,b}, fa = .999, f, = .001, and the encoding scheme is 
a — 000,b— 111. 


(a) Verify that MLD will decode every word w € {0, ne as ‘a’. 

(b) Calculate the reliability of this code-and-channel system and the max- 
imum error probability. 

(c) Same question as (b), but use NCWD. 

(d) How large must t be so that, if we consider the encoding scheme a > 
0’ =0---0 (¢ zeroes) and b > 1’, then MLD will decode 1’ as b? 

(e) Find the reliability and the maximum error probability of the code- 


and-channel system obtained by taking the scheme you found in part 
(d), when the decoding method is MLD and again when it is NCWD. 


. For each instance of 6 and p¢ as in 4.5.4, and each code-and-channel system 
with A = B and a fixed-length encoding scheme s; > wj, j = 1,...,m 


© 2003 by CRC Press LLC 


104 4 Coding Theory 


of length £, we can define a nearest-code-word sort of decoding associated 
with 6 and p, to be denoted NCWD(6, po), as NCWD was defined, but with 
the metric dgy arising from 6 and p playing the role that dy plays in the 
definition of NCWD. That is, having received w € Af, we decode w as that 
s; for which dgy(w, w;) is the least, provided there is a unique such /. If 
there is no unique such j, we report “do not decode.” 


One reason for considering the metrics dgy is the possibility that, given 
a code-and-channel system, there may be choice of 6 and p such that 
NCWD(6, p) and MLD coincide for that system. The requirements on the 
system for the existence of such a pair (6, 0) await disclosure, but we can 
see readily that there are cases when there is no such pair. 


(a) Show that, with (6, 0) and NCWD(6, pe) as above, if the code words 
wj,j=1,...,m, are distinct, then NCWD(6, p) will decode w; as s;. 

(b) Conclude that there is no pair (6, 0) for which NCWD(6, ¢) and MLD 
coincide on the code-and-channel system of Exercise 2(a), above. 


4. Estimate the reliability of each of the channels mentioned in exercise 1, 
above, by imposing equal source frequencies on the input characters. (Of 
course, in (c) the result will be exact.) 


5. The average error probability can be thought of as the average cost of an 
attempted transmission of a source letter (from the choosing of the source 
letter to the result after decoding), where the cost of an error is one unit, 
and the cost of no error is zero. 


It follows that we can refine the average error probability as a measure 
of system failure by distinguishing more finely among the outcomes of 
the “choose s; — transmit w; — decode” experiment, assigning different 
appropriate costs to these outcomes, and then calculating average cost. This 
type of refinement was alluded to in 4.5.8. 


For example, in the circumstances of Exercise 2 above, it may be that 
source message b is extremely grave and that it would be very costly to 
mistake a for b. Let us suppose that, whatever the decoding method, de- 
coding b when a was intended costs 1000 units, decoding a when b was 
intended costs 100 units, getting “do not decode” when a was intended 
costs one unit, and “do not decode” when b was intended costs 50 units. A 
correct transmission costs nothing. Find the average cost of an attempted 
transmission when 


(a) the decoding method is MLD and the scheme is as in Exercise 2(a), 
above; 


(b) the decoding method is NCWD and the scheme is as in 2(a); 

(c) the decoding method is MLD and the scheme is the one you found in 
2(d); 

(d) the decoding method is NCWD and the scheme is that of 2(d). 
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6. In each part of Exercise 1, compute the reliability, the average error proba- 
bility, and the maximum error probability of the system in which decoding 
is by NCWD and in which 


(i) “do not decode” counts as an error; 
(ii) “do not decode” does not count as an error; 
(iii) “do not decode” counts as one-half of an error. 


In each case, assume that the relative source frequencies are known, and 
are as given. 


7. The channel is binary and symmetric, with reliability p, 1/2 < p < 1. 
There are two source messages, a and b, with equal frequency. For a pos- 
itive integer t, consider the encoding scheme a > 0‘, b > 1’. Let R(t) 
and E(t) = 1 — R(t) denote the reliability and average error probability, 
respectively, of this code-and-channel system, using MLD (= NCWD, in 
this case); note that E(t) = E (t) by the symmetry of the situation. Let 
Ro(t) and Eo(t) denote the corresponding probabilities if “do not decode” 
is considered a success, not a failure. 


(a) Express R(t) and Ro(t) explicitly as functions of p and ¢. 

(b) Show that R(t +1) = Ro(t) = ()472)) P'/71 — p) "7/4" for each pos- 
itive integer fr. 

(c) Show that R(t) < R(¢+2) and that Ro(t) < Ro(t +2) for each positive 
integer f. 

(d) Show that R(t) > last —> co. 


{Hints: consider the cases where ¢ is odd or even separately, for (a), (b), 
and (c). For (d), use the fact that p > 1/2, and the Law of Large Numbers; 
see section 1.9.] 


oo 


. Suppose that |S| = m = 35, and we have a shortest possible fixed-length 
encoding scheme for a uniquely decodable code. How long will the MLD 
table be when 


(a) |A] = |B] = 2, 
(b) |A| = 2, |B] = 3; 
(c) |A] = |B] = 3? 


9. In Exercises 1 (b) and (c), above, note that the encoding schemes are as 
short as possible, but not thoughtfully conceived. For instance, in 1(b), 
the code word for e, the most commonly encountered source letter, is a 
Hamming distance 2 from the word for c, the least common code word, but 
a distance | from each of the words for b and d. Surely it would increase 
the reliability R if we interchange the code words representing c and d, or 
c and b. 


Verify that this is so. Also, find a fixed-length scheme, of length 3, to 
replace the scheme in I(c), which increases the reliability. 
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4.6 Shannon’s Noisy Channel Theorem 


The theorem referred to describes a beautiful relationship between the compet- 
ing goals of (a) transmitting information as rapidly as possible, and (b) making 
the average error probability as small as possible, given a source and a channel. 

For us, a “source” consists of a source alphabet and a probability distribu- 
tion over the source alphabet, the relative source frequencies. In the “model” 
that we have been using, the source letters are emitted, randomly and indepen- 
dently, with the given relative frequencies (which may be arrived at by observ- 
ing the source for a long time). Shannon’s Noisy Channel theorem applies to a 
more general sort of source, one which emits source letters, but not necessarily 
randomly and independently. We will have a more thorough discussion of these 
sources at the end of Chapter 7. 

It is curious that the Noisy Channel Theorem is widely regarded as the 
centerpiece of information theory, yet both the statement and the proof of the 
theorem are largely useless for practical purposes. Nor can it be said that the 
theorem has worked indirectly as an inspiration in the actual devising of efficient 
“error-correcting” codes, nor in the theory of such things, which could be, and 
usually is, laid out without a single occurrence of the word entropy. Yes, the 
greats of coding theory were aware of Shannon’s theory and the Noisy Channel 
Theorem, but so are professors of accounting or finance aware of the Unique 
Factorization Theorem for the positive integers. 

The rightful acclaim that the Noisy Channel Theorem enjoys arises, we 
think, from its beauty. Shannon’s definitions of information and entropy were 
audacious and not immediately convincing. Of course, a definition is not usually 
required to be convincing, but when you attempt a definition of a word that 
carries a prior connotation, as do information and entropy, the definition should 
have implications and overtones that agree with the prior connotation. Shannon 
was very aware of this informal requirement; he went so far as to show ( [63] and 
[65]) that if the entropy of a system is to be the average information contained in 
the system (of events), and if entropy is to satisfy certain plausible axioms, then 
information must be defined as it is. (As we saw in Section 2.1, a simpler and 
more convincing demonstration of the inevitability of Shannon’s quantification 
of information was later discovered by Aczél and Daroczy [1].) 

Still, the newcomer to the theory might be forgiven a bit of queasiness 
as conditional entropy and mutual information are added to the list of funda- 
mentals, with channel capacity coming along as a corollary. As touched on 
in Section 3.4, Shannon’s interpretation of channel capacity as measuring the 
maximum possible rate of information flow through a channel, which is what 
channel capacity sounds like it ought to measure, was supported by examples 
and extremal considerations—but is that enough? 

The beauty of the Noisy Channel Theorem lies at least partly in the valida- 
tion it provides of the interpretation of maximum mutual information between 
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inputs and outputs as measuring maximum possible information flow. In brief, 
the main statement of the theorem says that if the rate of information flow from 
the source is less than the channel capacity, then you have enough room to afford 
the luxury of error correction—you can make the maximum error probability as 
small as you please, if you are willing to take the trouble, while accommodating 
the information flow from the source with no accumulated delays or backlog. 
Does that not make the channel capacity sound like a maximum possible rate 
of information flow? Another similarly telling assertion in Shannon’s original 
formulation of the theorem is that if the rate of information flow from the source 
is greater than the channel capacity, then the average error probability cannot be 
reduced below a certain positive amount, a function of the source frequencies 
and time rate and of the channel’s transition probabilities. 

Think of a flash flood bearing down on a culvert. The culvert pipe can 
convey a certain maximum volume of water per unit time. If the rate at which 
the flood is arriving at the pipe entrance is below that maximum rate, then, 
in ideal principle, the water can be directed through the pipe without a drop 
sloshing over the roadway above the culvert, and without a pond of unconveyed 
water building up on the flood side of the culvert. In practice, the directing of 
the flood waters into the pipe, a civil engineering problem, will not be perfect — 
some water will be lost by sloshing. But as long as the flood flow is below the 
theoretical maximum that the pipe can handle (the pipe’s capacity), steps can be 
taken to reduce the sloshing loss (error) below any required positive threshold, 
while maintaining flow and avoiding backup. If the flood rate exceeds the pipe 
capacity, then no engineering genius will be able to avoid some combination 
of water loss and backup; the flood volume per unit time in excess of the pipe 
capacity has to wind up somewhere other than the pipe. 

These common sense observations regarding floods and culverts serve as a 
good analogy to the conclusions of the NCT, with the source stream playing the 
role of the flood waters, the channel playing the role of the culvert pipe, and the 
coding/decoding method playing the role of the hypothetical engineering mea- 
sures taken to direct the flood water into the pipe, so as to keep slosh tolerable 
while maintaining flow. The beauty and inevitability of the NCT reside in the 
closeness of this analogy, in our opinion. 

The geometry of the ancient Greeks could have been used by ancient Greek 
craftsmen to make measurements and designs, but the historical evidence appar- 
ently indicates that it was not so used. Geometry was propagated in the intel- 
lectual world over two millenia purely because of its beauty; utility was not a 
factor. The Noisy Channel Theorem inserts a promise of inevitability, of im- 
mortality, in information theory. It is analogous to the theorem in geometry that 
the sum of the interior angles of a planar triangle is a straight angle. There is 
more to geometry than that, and we hope that there will be more to information 
theory than the Noisy Channel Theorem; nonetheless, the theorem alone, and 
its immediate consequences, are of considerable weight. 
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Preliminaries Suppose a channel, (A, B, Q), is given, and also a source alpha- 
bet S = {51,..., 5m} with relative source frequencies f|,..., fm, all positive. As 
usual, H(S) = )); fj log 1/f;. 

Suppose that the source emits r characters per unit time. Since the average 
information content per character is H (S), it follows that the source is emitting 
information at the rate r H(S) information units per unit time, on average. Let 
C denote the channel capacity. Suppose that the channel transmitter can send 
off input letters per unit time. (So, by Shannon’s interpretation of /(A, B), 
the channel can convey a maximum of pC units of information per unit time, 
on average.) 


4.6.1 The Noisy Channel Theorem If rH(S) < eC, then for any € > 0 it is 
possible to encode the source stream and arrange a decoding method so that on 
average each source character is represented by no more than p/r input charac- 
ters, and so that the maximum, over the source characters, of the probability of 
an error at an occurrence of the character in the source stream, is less than e. 

If rH(S) > eC, then there is a positive number ¢9 such that no matter 
how the source stream is encoded and decoded, if, on average, each source 
character is represented by no more than p/r input characters, the average error 
probability of the code-and-channel system will be greater than or equal to &0. 


We will give a synopsis of the proof shortly, but first a few comments are in 
order. Since input letters are transmitted at the rate of p per unit time and source 
letters appear at the rate of r per unit time, clearly the requirement that the 
source stream be encoded so that the average number of input letters per source 
letter is not greater than o/r is meant to insinuate that the flow of information 
from the source, through the channel to the receiver, and then to the decoder, 
proceeds smoothly, with no backlog of unprocessed source letters. It can be, 
and often is, objected that this view of things leaves out of account the time 
spent encoding and decoding. However, this objection is not entirely fair. 

In the proof of the NCT, the encoding is to be by a fixed length scheme 
applied, not to S, but to S%, for some large integer N. The length of the 
fixed-length scheme is to be approximately oN/r. Another objection to this 
procedure is that the encoder has to wait N/r time units for a sequence of N 
source letters to accumulate. But the two objections, about time spent encod- 
ing and decoding on the one hand, and time spent waiting for N source letters 
to accumulate on the other, cancel each other out, if we can be quick enough 
in encoding and decoding—and by “quick enough” we do not mean instanta- 
neous, we mean taking no longer than N/r time units to encode a source word 
of length N and to decode an output word, at the receiver, of length pN/r. If 
we can encode and decode that quickly, then we can spend the time waiting for 
the next source word of length N to accumulate by encoding the most recently 
emerged source word, transmitting the encoded version of the one before that, 
and decoding the one before that. It is true that there will be a hiatus of up to 
N/r time units between reports from the decoder, but the reward for the wait 
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will be, not a lone source letter, but a great hulking source word of length N. 
So the source stream lurches rather than flows, but the average rate at which 
source letters emerge from the decoder will be the same as the rate at which 
they entered the encoder: r per unit time. 

If you followed this discussion, you might then look back at the statement 
of the NCT and wonder what that business is about the maximum error proba- 
bility over the source characters. If we are going to encode S$", shouldn’t we 
think about the maximum error probability over SN? Well, no; and, in fact, 
that maximum error probability will not be made small. What is meant by the 
probability of an error at an occurrence of a source letter s is the probability 
that, when s occurs in the source stream, the place in the stream emerging from 
the decoder that was occupied by s originally, is occupied by something other 
than s. If we are encoding source words of length N, such an error can oc- 
cur only if s occurred in some source block of length N that got misconstrued 
by the code-channel-decode system; that is, every such error is part of a larger 
catastrophe. 

In asserting that the maximum error probability over the single source let- 
ters can be made as small as desired, with a code that keeps up the rate of 
information flow when rH(S) < pC, we are departing from the usual state- 
ment of this part of the NCT. The usual statement these days (see [4] and [81]) 
is mathematically stronger, or no weaker, than our statement, but suffers from 
opacity. We think it is wise to sacrifice strength for friendliness in a theorem 
that is not really used for anything. We will mention the usual conclusion in the 
proof synopsis, below. 

In the last assertion of the NCT, we also depart from Shannon’s version, 
again for esthetic reasons. We will indicate how Shannon put it below. 


Synopsis of the proof of the NCT As mentioned above, the idea is to encode 
S’ with N a large integer, by a fixed-length scheme of length ¢ = |pN/r]. 

By the Law of Large Numbers, if N is large, the source words of length N 
in which the proportions of the source letters within the word differ markedly 
from the f; have very small probability, collectively. [For instance, if S$ = 
{a,b,c}, fe = .2, and n = 10,000, the probability that a source word of length 
10,000 will contain fewer than 1,000 c’s, or more than 3,000 c’s, is minuscule. ] 
We take N large and divide S% into two sets of words, L (for likely), in the 
words of which the source letters occur in proportions quite close to the f;, and 
U (for unlikely), U = S% \ L. How large N is, and how close those proportions 
are to the f;, dependon ¢,r, H(S), p, and C. In any case, P(U) = ee P(w) 
is very small. 

Now the idea will be to assign to each word in L a code word w € A®, 
£=[|peN/r]. As for the words in U—ignore them! If you must encode them, 
assign them any which way to the code words for L. This means that when a 
word in U actually occurs, after encoding and transmission the word received is 
almost certain to be misdecoded—but the likelihood of a word in U emerging 
from the source is, by arrangement, so small that this certainty of error in these 
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cases will have very little effect on the probabilities of errors at occurrences of 
the source characters. 

Now, how do we find an encoding scheme for L > A? This is where we 
will be very synoptic; we don’t actually “find” an effective encoding scheme. 
There is a probabilistic proof of the existence of w1,..., WL] € A® such that if 
the w; are transmitted and decoded by MLD, assuming they have approximately 
equal likelihood L~! ~ (Tj fl y", then max<j<z P(the received word is 
misdecoded | w; was transmitted) < ¢/2, say. (This is the stronger conclusion 
in this part of the NCT alluded to in earlier discussion. Provided P(U) < ¢/2, 
it certainly implies that the maximum probability of an error at a single source 
character is less than e.) 

It is in this probabilistic existence proof that the inequality r H(S) < pC en- 
ters strongly—of course, it has already subtly influenced the foresightful choice 
of N to be sufficiently large to make everything work. We omit all details of 
this proof—we hope to stimulate the reader’s curiosity. Be warned, however, in 
looking at the proofs in [4], [37], and [81], that the tendency is to first convert 
the source S' to a binary source, which murks everything up. 

As for the other conclusion of the NCT, whenr H(S) > pC, it can be shown 
by general monkeying around as in Shannon’s original proof in [63] that the 
conditional entropy or equivocation, H (S | B), the measure of uncertainty about 
the source, given the output, no matter what the coding method, can be no less 
that (r/p)H(S) —C. Shannon would have left it at that. To get the conclusion 
we desire, we need a connection between the equivocation and the average error 
probability. There is one. It is called Fano’s Inequality. See, e.g., [81], p. 43. 

This concludes our synopsis. You can see that the horrible thing about the 
proof is that, when r H < C, it does not tell you how to encode S$" so that maxi- 
mum single-letter error probability is made less than ¢. Sharper formulations do 
give estimates of N, but the estimates are discouragingly large. Small wonder 
that NCT is solemnly saluted by coding theorists far and wide, and then put in 
a drawer; in the next section we will look at the fields where practical coders 
really play. 


Exercises 4.6 


1. Suppose that S = {a,b,c}, fa = .5, fp = .3, fc = .2, and the channel is 
binary and symmetric with reliability .95. Suppose that the channel can 
transmit 100 bits (binary digits) per second. According to the NCT, what 
is the upper limit on the number of source characters per second that this 
channel can theoretically handle without backlog and with maximum single 
letter error probability as small as desired (but not zero)? 


2. Suppose that the source alphabet S, the relative source frequencies, and the 
channel are as in problem 1. Suppose that S? is encoded as follows: 
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aa — 0000 ba — 0101 ca — 1001 
ab — 1111 bb — 1100 cb— 0110 
ac — 1010 bc > 0011 cc > 1110 


Find the maximum single letter error probability and the average single 
letter error probability. [This will involve more than making an MLD table, 
but you may as well make one as an aid. Assume that the source letters are 
emitted randomly and independently, so that the relative frequency of the 
two-letter sequence ac, for instance, would be (.5)(.2) = .1.] 


3. Again, S and the relative source frequencies are as in problem 1, and it is as- 
sumed that the source letters are emitted randomly and independently. Find 
a binary encoding scheme for S* using Huffman’s algorithm, and compute 
the average number of code letters per source letter if this scheme is used 
to encode the source stream. 


SSS 


4.7 Error correction with binary symmetric channels 
and equal source frequencies 


The case of equal source frequencies and a binary symmetric channel is a very 
important special case because it is the case we think we are in, in a great 
number of real, practical situations in the world today. Or, perhaps we just 
hope and assume that we are in this situation; see 4.5.7 and Section 3.1. 

By Corollary 4.5.6, when the channel is binary and symmetric with reli- 
ability p > 1/2 and the source frequencies are equal, MLD and NCWD coin- 
cide. Thus, for each w ¢€ {0, 1}¢ received, we decode by examining the words 
WI,...,Wm € {0, 1}¢ in the encoding scheme and picking the one, if any, closer 
to w in the Hamming distance sense than are any of the other w;. In this section 
we will see a way that this procedure might be simplified, at the cost of some 
reliability. 

As remarked in 4.5.7, the situation described in the title of this section is 
the setting of most of coding theory, which is mainly about binary block codes. 
We shall not go far into that theory, but during our excursion we shall observe 
its customs. For one thing, we shall refer to the set C = {w1,..., Wm} C€ {0, 1}¢ 
of code words appearing in the encoding scheme as the code, and all mention 
of the source alphabet and of the encoding scheme will be suppressed. This is 
not unreasonable in the circumstances, since our decoding method is NCWD; 
the only thing we need to know about the source alphabet is its size, m < 2°. 


Definitions The operation + is defined on {0, 1} byO0O+0=0,0+1=1+0=1, 
and 1+ 1=0. The operation + is then defined on {0, 1} coordinatewise, given 
the definition above. [For example, with € = 5, 01101+11110= 10011.] 

The Hamming weight of a word w € {0, i} is wt(w) = number of ones 
appearing in w. [For example, wt(10110) = 3.] 
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If w1,..., Wm € {0, 1}¢ are distinct words, the distance of the code C = {w 1, 
..., Wm} Is 
d(C) = min dy (uj, wj) = min dy(w, v) 
l<i<j<m : w,vEeC 
wv 


4.7.1 For u,v € {0, Le, note that w+ v has ones precisely where u and v differ. 
Thus dy(u, v) = wt(u+ v). 


4.7.2. Verify that wt(u+ v) < wt(w) + wt(v), for all u,v € {0, 1}. 


Definition We will say that a code C C {0, 1}° corrects the error pattern u € 
{0, 1}, if and only if, for each w € C, NCWD will decode w+u as w (or, as 
whatever source letter w represents). 


The wu appearing in the last definition above could be any binary word of 
length £2. When we call wu an error pattern we are thinking that, during the 
transmission of a binary word of length @ through the channel, errors occurred 
at precisely those places in the word marked by 1’s in u. Thus, by the definition 
of +, if w was transmitted and the error pattern u occurred, the word received 
at the receiving end of the channel would be w+ u. Thus the definition above 
says that C corrects u if and only if, whenever the error pattern uv occurs and 
the code C is in use, NCWD (= MLD) will correctly decode the received word, 
whichever w € C was sent. 


4.7.3 Example Let C = {00000, 11100, 01111}. Verify that C corrects: 00000, 
10000, 01000, 00100, 00010, 00001, and no other error patterns. Note that if 
11100 is transmitted and the error pattern 01010 occurs, then 10110 will be 
received, which is closer to 11100 than to either of the other two code words; 
but if 00000 or 01111 is transmitted and that error pattern occurs, NCWD will 
decide not to decode. Thus that error pattern is not corrected by the code. 


4.7.4 Let C = {0°, 0713, 1303, 1%}. Verify that the set of error patterns cor- 
rected by C is {0°} U {all 6 binary words of length 6, of Hamming weight 1} U 
{100100, 100010, 100001, 010100, 010010, 010001, 001100, 001010, 001001}. 


4.7.5 Theorem Suppose C © {0,1}* and |C| > 2. Then C corrects all error 
patterns of length £, of Hamming weight < (d(C) — 1)/2. 


Proof: Suppose u € {0, 16 and wt(u) < (d(C) — 1)/2. Suppose that w,v € C 
and w # v. Then 
d(C) <dy(w,v) <dy(w,w+u)+dy(w+u, v) 
=wt(w+(wt+u))+dy(w+u,v) 
=wt(u)+dy(wt+u, v) 
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which implies 
d(C)—-1 
dy(w +u, v) > d(C) — wt) > d(C) — = 
_ d(C)+1 : d(C)-1 
oe 2 
Thus, for each w € C, w is the unique word in C closest to w+u, so NCWD 
will decode w+u as w. O 


> wt(u) =dy(w,wtu). 


4.7.6 Corollary Lett = |x|. Then, with E denoting the maximum error 
probability with C in use, with a binary symmetric channel with reliability p > 
1/2, 


t 
E<1-)> (S)ora =p) 
j=0 J 


Proof: Let U = {u € {0, 1}°; wt(u) <1}, and, for each v € C, Ny = {w € {0, 1}*; 
NCWD decodes w as v}. By the theorem, 


v+U ={v+u;uEeU}CN,, foreachv eC. 
Therefore, for each v € C, with w denoting “the received word,” 


P(weé WN, |vissent) > P(weuv+U | vis sent) 

= P (the error pattern u lies in U | v is sent) 

= Puev) 

_ ( t or fewer errors occurred, in £ trials, ) 
with probability 1 — p of error on each trial 

t 
3% €\ ei j 
= .}p ’C-—p)’, by Theorem 1.5.7. 
joo 


Since vu € C is arbitrary, the desired conclusion follows. O 


Definition Suppose C C {0,1} and |C| > 2. Let d = d(C). In simplified 
nearest code word decoding (SNCWD), a received word w € {0, 1}° is decoded 
as v € C if and only if dy(v, w) < tS. If there is no such v € C, do not decode 
w. 


By the proof of Theorem 4.7.5, for each w € {0, 1} there is at most one 
v €C such that dy(v, w) = wt(v+w) < (d(C) —1)/2. Observe that if vu = 
u+w then v = w+u, because of the peculiar definition of +. Consequently, 
the carrying out of SNCWD can proceed as follows: given w, start calculating 
the words v + w, v € C, until you run across one of weight < (d(C) — 1)/2. If 
you have saved the v, report that as the intended code word. Alternatively, v 
can be recovered from v + w and w by addition. If there is no v € C for which 
wt(v+w) < (d(C) — 1)/2, report “do not decode.” 
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Is this procedure any easier than plain old NCWD? From the naive point of 
view, no. In both procedures you have to calculate dy(v, w),v € C until either 
av is found for which dy(v, w) < (d(C) — 1)/2, or until all v € C have been 
tried, at which point, with NCWD, the numbers dy(v, w) must be compared. 
You save a little trouble with SNCWD by omitting this last comparison; but 
surely, you might think, this saving would not compensate us sufficiently for 
the loss of reliability incurred by forsaking NCWD for SNCWD. 

But the fact is that SNCWD is very commonly used. The details are beyond 
the scope of this course. Suffice it to say that knowing exactly which error 
patterns will be corrected sometimes leads, in the presence of certain algebraic 
properties of C, to very efficient decoding procedures. 

If we define “C corrects the error pattern u” for SNCWD as it was defined 
for NCWD, it is easy to see that, with SNCWD, the error patterns corrected 
by C are precisely the words of weight < (d(C) — 1)/2; furthermore, the error 
patterns corrected correctly are the same for different code words—see Example 
4.7.3 to see that this is not necessarily the case with NCWD. Note also Example 
4.7.4 and compare the error patterns corrected there by NCWD with the error 
patterns corrected by SNCWD. 


Reliability For C C {0, 1}, let R(C, p) denote the reliability of the code-and- 
channel system obtained by using C, a binary symmetric channel with relia- 
bility p, and NCWD. (As elsewhere in this section, the source frequencies are 
assumed to be equal.) Let Rs(C, P) denote the reliability when SNCWD is 
used. Corollary 4.7.6 implies that 


d(C)-1 
et 
RC p> > (5) ota- py, 
jo 
Loar 


4.7.7 Proposition Rs(C, P) = )) jo (pa — py. 


The proof, after that of 4.7.5 and the remarks above, is straightforward. 

Ro(C, p) and (Rs)o(C, p) will denote, as in Exercise 4.5.7, the relaxed 
reliabilities obtained by not considering a “do not decode” message to be an 
error. 


Exercises 4.7 
1. Express explicitly, as formulas in p, the reliabilities R(C, p), Rs(C, p), 
Ro(C, p), and (Rs)o(C, p), when C is the code of Example 4.7.3. 
2. Same question for Example 4.7.4. 


3. There are two famous binary block codes, of lengths 23 and 24, called the 
Golay code and the extended Golay code, respectively. Let us denote the 
Golay code by C23, and the extended Golay code by C24. Their distances 
are d(C23) = 7 and d(C24) = 8. 
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Both codes have the remarkable property that NCWD and SNCWD coin- 
cide when these codes are in use. C23 has the further property that there is 
never a “do not decode” result. This is not the case with C24, however, and, 
in fact, every error pattern of weight 4 occurring when Cg is in use will 
result in a “do not decode” message. 


(a) Express R(C23, p) and R(C24, p) explicitly as functions of p, assum- 
ing p> 1/2. 

(b) Show that R(C23, p) > R(C24, p) for 1/2 < p < 1, but that 
Ro (C23, p) = R(C23, p) < Ro(C24, p) for 1/2 < p <1. 

(c) |C23| = |C24|, both codes come equipped with very fast NCWD algo- 
rithms, and the slightly greater length of C24 is a negligible drawback; 
so, what do the results of part (b), above, suggest to you about which 
of the two codes you would choose to use? In some circumstances you 
would take C23 over C24, and in other circumstances it would be the 
other way around. What sort of consideration decides the choice? Be 
brief. 


4. C € {0,1}° and the channel is binary and symmetric with reliability p. 
Show that t = [49 is the largest integer among the integers i with the 
property that C will correct (using NCWD) all error patterns of weight <7. 


=a 


4.8 The information rate of a code 


For a binary code C C {0, 1}, the information rate of C is generally defined 
to be (log,|C|)/£. To see what this really means, and how to generalize it 
to variable-length encoding schemes over possibly non-binary code alphabets, 
suppose that C is used to encode a source alphabet S with m = |C| letters, of 
equal relative frequencies. Then H(S) = log, m = log, |C|; this is the number 
of bits of information carried by each code word. Since each code word is € 
bits long, the number of bits per bit, so to speak, carried by the code words is 
(log,|C|)/£. To put it another way, (log, |C|)/€ is the rate at which the code 
words are carrying information, in bits per input (code) letter. 

The account preceding rests on the assumption that the relative source fre- 
quencies are equal. In the more general situation when we have a source al- 
phabet S, |S| =m, with possibly unequal relative frequencies, and a uniquely 
decodable scheme for S — A, where A is a code alphabet with |A| =n > 2, the 
discussion above is adaptable to give the result that the information rate of the 
code, interpreted as the average amount of (source) information carried by the 
code words, per code letter, is H(S)/€, where £ is the average code word length, 
computed from the scheme and the relative source frequencies. The units of in- 
formation are determined by the choice of the base of the log appearing in the 
computation of H(S). To accord with the binary case, we may as well adopt the 
convention that that base is to ben = |A|. 
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If A is the input alphabet of a channel, at what rate, then, is information 
originating from the source appearing at the receiver of the channel, per out- 
put letter, given a uniquely decodable scheme for S$ — A? Since there is one 
output letter per input letter, and /(A, B) gives the rate of flow of information 
through the channel, i.e., the average number of units of information arriving 
at the receiver for each unit of information transmitted, it follows that the aver- 
age amount of information from the source to the receiver, per output letter, is 
(H(S)/2) ((A, B)|p,....,p,), with log = log, and pj,..., Pn computed from the 
encoding scheme and the relative source frequencies, as in section 4.4. 

Both H(S)/@ and this quantity multiplied by /(A, B)| PissPn ale indica- 
tors of an encoding scheme’s efficacy, but let us be under no illusions as to 
the sensitivity of these indices. Notice that H(S)/@ is increased only by de- 
creasing €, within the requirement of unique decodability; clearly different 
uniquely decodable encoding schemes with the same @ can have very different 
qualities. In particular, in the case of fixed-length encoding schemes, this in- 
dex leaves error-correction facility and encoding/decoding speed and efficiency 
completely out of account. The index (H (S)/€)(1(A, B) | py,...,pn) 18 Somewhat 
more interesting—it goes up as £ decreases and/or as the p1,..., Pn resulting 
from the scheme better approximate the optimal input frequencies of the chan- 
nel. But it still leaves out of account code qualities of practical interest. (See 
Exercise 4.8.3.) 

This does not mean that these indices are useless! Consider that knowing 
the area of a planar figure tells you nothing about the shape or other geometric 
and topological properties of the figure. Does that mean that we should give up 
on the parameter we call “area?” Just so, the two parameters we are discussing 
here will have their uses in the discussion and comparison of code-and-channel 
systems. We need to be aware of the limitations of these discussions and com- 
parisons, but if we are aware, then let’s proceed! From here on, we will refer 
to H(S)/@ as the (pretransmission) information rate of the code involved, and 
(H(S)/2) ((A, B)|p,,...,p,) a8 the (post-transmission) information rate of the 
code-and-channel system. 

Given S, f|,..., fm, A, B, Q, and a fixed-length uniquely decodable scheme 
for S—> A, sj > wy € Af, j=1,...,m, as in Section 4.5, there is another pa- 
rameter associable to the code-and-channel system that might be preferable to 
(H(S)/£)((A, B)) as a measure of information flow from the source to the 
channel receiver, per output letter: /(S,B°)/¢. Here S and B® stand for the 
obvious systems of events associable to the multistage probabilistic experiment 
described in section 4.5. The mutual information /(S, B®) is divided by £, 
above, to make the result comparable to (H(S)/£)(/(A, B)|p,,...,p_) aS a mea- 
sure of information conveyed per output letter. 

Shannon’s interpretation of /(A, B) as measuring the average amount of 
information conveyed by the channel (given certain relative input frequencies) 
per input letter (see Section 3.4) transfers to an interpretation of /(S, B®), in this 
more complicated situation, as the average amount of information conveyed by 
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the code-and-channel system, per source letter, given f,..., fin and a fixed- 
length encoding scheme for S > A. Thus /(S, B®)/€ would seem to be a mea- 
sure of rate of information flow from the source to the channel receiver, per 
output letter, that is more reflective of error correction concerns and a gener- 
ally more sensitive index of the efficacy of the code-and-channel system than is 
(H(S)/€)((A, B)). 

However, /(S, B°)/€ as an index of goodness suffers from a grave defect: 
it is frightfully difficult to calculate, even in the simplified circumstances of 
binary block codes with equal source frequencies and a binary symmetric chan- 
nel. Suppose we are in these circumstances and, in addition, the relative input 
frequencies generated by the use of the code are pop = pj = 1/2. (This is a 
common circumstance in practice. For instance, 0 and 1 occur equally often 
when the Golay codes, mentioned in Exercise 4.7.3, are used with equal source 
frequencies. The same holds for any linear block code — see [30] for definitions 
— containing the word with all ones, and almost all of the commonly used bi- 
nary block codes are of this sort.) Then (H(S)/€)1(A, B) = “22! (plog, 2p + 
(1 — p) logy 2(1 — p)), where C C {0, 1)" is the code (|C| =m = |S|) and p is 
the reliability of the channel. That is, the post-transmission information rate is 
just the conventional information rate times the channel capacity. Meanwhile, 
1(S, B®) is a daunting sum of m- 2° terms — see Exercise 4.8.2. For particular 
binary block codes this expression can be greatly simplified — but not enough 
to put it in the category of (log, |C|)/£, d(C), or € itself as easily consulted 
indices of the quality of a binary block code, in standard circumstances. One 
could argue that the difficulty of calculating /(S, B®) is the price you pay for 
the subtle power of this index. But an indicator that is harder to calculate than 
the items of interest that it might be an indicator of, like reliability and error 
probability, is not a useful indicator. 

Still, 7(S,B°) is an important and interesting number associated with a 
fixed-length code-and-channel system, and an academic study of its behavior 
and its relation to other indicators may bring some rewards. Here is a question 
for anyone who might be interested: by Corollary 2.4.6, /(S, B“) < H(S); is it 
necessarily the case that /(S, B°) < H(S)I(A, B)? I(A, B) here is, of course, 
calculated with the relative input frequencies produced by the encoding scheme 
and the relative source frequencies. 


Exercises 4.8 


1. Calculate H(S)/€ and (H(S)/£)U(A, B)) for each of the fixed-length 
code-and-channel systems in Exercise 4.5.1. 


2. Given S, fi,.--, fm, A, B, Q, and a fixed-length encoding scheme s; — 


Wj = ig jyr+++s iggy € A’: verify that 
- . Mini gic. 
e z=1 Wiz, j) tz 
USB)=))f DY [4c loses 
j=l 1Sty,...,t¢e<kz=1 Pe fu Tyat Gitew).t: 
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Write this out as a function of p, using log = logs, in case m = 4, fj = 
1/4, j =1,...,m, {w1, w2, w3, w4} = {0000, 0011, 1100, 1111}, and the 
channel is binary symmetric with reliability p. Compare 1(S, B®) to H(S) 
and to H(S)/(A, B) in this case. 

3. Describe a binary block code C of length 23 with the same information rate 
and post-transmission information rate as the Golay code, C23, mentioned 
in Exercise 4.7.3, with d(C) = 1. [You need to know that |C23| = 2!” and 
that 0 and 1 occur equally often, overall, in the code words of C23. Assume 
equal source frequencies. ] 
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Chapter 5 


Lossless Data Compression by 
Replacement Schemes 


Most (but not all) modern data compression problems are of the following form: 
you have a long binary word (or “file”) W which you wish to transform into a 
shorter binary word U in such a way that W is recoverable from U, or, in ways to 
be defined case by case, almost or substantially recoverable from U. In case W 
is completely recoverable from U, we say we have lossless compression. Oth- 
erwise, we have lossy compression. The compression ratio is lgth(W)/Igth(U). 
The “compression ratio achieved by a method” is the average compression ra- 
tio obtained, using that method, with the average taken over all instances of W 
in the cases where the method is used. (This taking of the average is usually 
hypothetical, not actual.) 

Sometimes the file W is sitting there, available for leisurely perusal and 
sampling. Sometimes the file W is coming at you at thousands of bits per sec- 
ond, with immediate compression required and with no way of foretelling with 
certainty what the bit stream will be like 5 seconds from now. Therefore, our 
compression methods will be distinguished not only by how great a compres- 
sion ratio they achieve, together with how much information they preserve, but 
also by how fast they work, and how they deal with fundamental changes in 
the stream W (such as changing from a stream in which the digits 0, 1 occur 
approximately randomly to one which is mostly 0’s). 

There is another item to keep account of in assessing and distinguishing 
between compression methods: hidden costs. These often occur as instructions 
for recovering W from U. Clearly it is not helpful to achieve great compression, 
if the instructions for recovering W from U take almost as much storage as W 
would. We will see another sort of hidden cost when we come to arithmetic 
coding: the cost of doing floating-point arithmetic with great precision. Clearly 
hidden costs are related to speed, adaptability, compression ratio, and informa- 
tion recovery in a generally inverse way; when you think you have made a great 
improvement in a method, or you think you have a new method with improved 
performance over what went before, do not celebrate until you have looked for 
the hidden costs! 

In the compression method to be described in this chapter, the hidden costs 
are usually negligible—an encoding scheme very much smaller than the file has 
to be stored—and the method is lossless with very fast decompression (recover- 
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ing W from U). On the down side, the method is applicable mainly in the cases 
where the file W is lying still and you have all the time in the world, and the 
compression ratios achieved are not spectacular. Roughly, “spectacular” begins 
at 10-to-1, and the compression ratios we will see by this method are nowhere 
near that. 

From now on, the code alphabet will be {0, 1}, unless otherwise specified. 
We will survey extensions of our methods and results to the non-binary cases 
from time to time, but the binary alphabet is king at this point in history, so 
it seems more practical to do everything binary-wise and occasionally mention 
generalizations, rather than to struggle with the general case in everything. 


———E 


5.1 Replacement via encoding scheme 


In a nutshell, the game is to choose binary words $1, ..., 5; into which the orig- 
inal file can be parsed (divided up), and then to replace each occurrence of each 
s; in the parsed file with another binary word w;; the w; are to be chosen so 
that the new file is shorter than the original, but the original is recoverable from 
the new. This kind of game is sometimes called zeroth-order replacement. You 
will see how “zero” gets into it later, when we consider higher order replace- 
ment. The assignment of the w; to the s; is, as in Chapter 4, called an encoding 
scheme. 
For example, suppose we take 


51 =0 

sg = 10 

53 =110 («) 
s4 = 1110 

s5=1111, 


and the original “file” is 111110111111101110111101110110. (Of course this is an 
unrealistic example!) The file can be parsed into the string 555255548455815453. 
(Notice that there was no choice in the matter of parsing; the original binary 
word is uniquely parsable into a string of the s;. Notice also that we are avoiding 
a certain difficulty in this example. If we had one, two, or three more 1’s at 
the end of the original file on the right, there is no way that we could have 
incorporated them into the source word, the string of s;’s.) 
Now, suppose we encode the s; according to the encoding scheme 

sy > 1111 

59 > 1110 

s3 > 110 (2k) 

S47 10 

55 > 0. 
The resulting new file is 01110010100111110110. Notice that this file is 20 bits 
long, while the original is 30 bits long, so we have a compression ratio of 3/2. 
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(Not bad for an unrealistically small example! But, of course, for any positive 
number R it is possible to make up an example like the preceding in which the 
compression ratio is greater than R. See Exercise 5.1.3.) 

But is the old file recoverable from the new file? Ask a friend to translate 
the new file into a string of the symbols s;, according to (**). Be sure to hint 
that scanning left to right is a good idea. Your friend should have no trouble in 
translating the new file into 555255545455515483, from which you (or your friend!) 
can recover the old file by replacing the s; with binary words according to (+). 

Those who have perused chapter 4, or at least the first two sections, will not 
be surprised about the first stage of the recovery process, because the encoding 
scheme (+) satisfies the prefix condition. You might notice, as well, that the 
definition of the s; in (*), regarded as an encoding scheme, satisfies the prefix 
condition, as well. This is no accident! 

We shall review what we need to know about the prefix condition in the 
next section. For now, we single out a property of s),...,55 in the preceding 
example that may not be so obvious, but which played an important role in 
making things “work” in the example. 


Definition A list 5;,...,5,, of binary words has the strong parsing property 
(SPP) if and only if every binary word W is uniquely a concatenation, 
W = Si, +++ Si,U, 


t 


of some of the s; and a (possibly empty) word v, with no s; as a prefix (see 
Section 5.2), satisfying lgth(v) < maxj<j;<  lgth(s;). 


The word v is called the leave of the parsing of W into a sequence of 


the s;. The uniqueness requirement says that if W = sj, ---8;,0 = Sj, -+-Sj,U, 
with neither uw nor v having any of the s; as a prefix and Igth(v), Igth(u) < 
max; lgth(s;), thent =r, andi; = j),...,i; = jr, andv =u. 


Notice that in any list with the SPP the s; must be distinct (why?), and 
any rearrangement of the list will have the SPP as well. Therefore, we will 
allow ourselves the convenience of sometimes attributing the SPP to finite sets 
of binary words; such a set has the SPP if and only if any list of its distinct 
elements has the SPP. 

To see that s;,...,55 in (*) in the preceding example have the SPP, think 
about trying to parse a binary word W into a string of the s;, scanning left to 
right. Because of the form of 51, ..., 55, the parsing procedure can be described 
thus: scan until you come to the first zero, or until you have scanned four ones. 
Jot down which s; you have identified and resume scanning. Pretty clearly this 
procedure will parse any W into a string of the s; with leave v =A, 1, 11, or 111 
(with A denoting the empty string). It becomes clear that this parsing is always 
possible, and most would agree that the parsing is unique, on the grounds that 
there is never any choice or uncertainty about what happens next during the 
parsing. We will indicate a logically rigorous proof of uniqueness, in a general 
setting, in the exercises at the end of the next section. 
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Which sets of binary words have the SPP? We shall answer this question 
fully in the next section. But there is a large class of sets with the SPP that ought 
to be kept in mind, not least because these are the source alphabets that are 
commonly used in current data compression programs, not only with the zeroth- 
order replacement strategy under discussion here, but also with all combinations 
of higher order, adaptive, and/or arithmetic methods, to be looked at later. All 
of these methods start by parsing the original file W into a source string, a long 
word over the source alphabet S$. The methods differ in what is then done with 
the source string. 

The most common sort of choice for source alphabet is: S = {0, 1}4, the set 
of binary words of some fixed length L. Since computer files are commonly or- 
ganized into bytes, binary words of length 8, the choice L = 8 is very common. 
Also, L = 12, a byte-and-a-half, seems popular. 

If S = {0, 1}, the process of parsing a binary word W into a source string 
amounts to chopping W into segments of length L. If ZL = 8 and the original 
file is already organized into bytes, that’s that; the parsing is immediate. There 
is another good reason for the choice L = 8. When information is stored byte 
by byte, it very often happens that you rarely need all 8 bits in the byte to record 
the datum, whatever it is. For instance, suppose there are only 55 different basic 
data messages to store, presumably in some significant order. You need only 6 
bits (for a total of 2° = 64 possibilities) to accommodate the storage task, yet it 
is customary to store 1 datum per byte. Thus one can expect a compression ratio 
of at least 8/6 = 4/3 in this situation, just by deleting the 2 unused bits per byte. 
Thus the historical accident that files are, sometimes inefficiently, organized into 
bytes, makes the choice L = 8 rather shrewd. The best zeroth-order replacement 
method, Huffman encoding, takes advantage of this inefficiency, and more. In 
the hypothetical situation mentioned above, we might expect something more 
like 8/log, 55 as a compression ratio, using S = {0, 1}® and zeroth-order simple 
Huffman encoding. Details to follow! 

Even though S = {0, 1}4 is the most common sort of choice of source al- 
phabet, we do not want to limit our options, so we will continue to allow all S$ 
with the SPP; these S will be completely characterized in the next section. 

A problem that may have occurred to the attentive anxious reader is: what 
effect should the /eave have in the calculation of the compression ratio? In real 
life, the original binary word W to be parsed and compressed is quite long; 
saving and pointing out the leave may require many more bits than the leave 
itself, but the added length to the compressed file will generally be negligible, 
compared to the total length of the file. For this reason, we shall ignore the 
leave in the calculation of the compression ratio. Therefore, if S = {s1,..., 5m} 
and a file W parses into W = sj, ---s;,v, lgth(v) < max; lgth(s;), and if the 
sj are replaced according to the encoding scheme s; > w;, i = 1,...,m, the 
compression ratio will be 4 Igth(s;,)/ i—1 Igth(wj;;), regardless of v, by 
convention. 
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Exercises 5.1 


1. Suppose S = {0, 1} is the source alphabet. What are the possible leaves in 
the parsing of files by S? 


2. Show that each of the following does not possess the SPP. 


(a) S = {0,01,011,0111, 1111}. [Hint: let W =01.] 
(b) S = {00, 11,01} 
(c) S= {0, Lt \{w}, for any binary word w of length L. 


3. Invent a situation, with a source alphabet S = {5),...,5,,} satisfying the 
SPP, an encoding scheme s; > wj;, j = 1,...,m, satisfying the prefix con- 
dition (see the next section), and an original file W, such that the compres- 
sion ratio achieved by parsing and then encoding W is at least 5 to 1. [Hint: 
you could take the example of this section as a model, with s; = 0, s2 = 10, 
..., With an encoding scheme in the mode of the example, and with a silly 
W; what if W has all 1’s?] 


4. Find the compression ratio if the original file in the example in this section 
is parsed using S = {0, 1}? and encoded using the scheme 


000 — 1111111 100 — 1110 
001 — 1111110 101 > 110 
010 — 111110 110 10 
011 — 11110 111-0 


—————— eee 


5.2 Review of the prefix condition 


We collect here some of the definitions and results from Chapter 4 and apply 
them to our current purpose, the characterization of lists of binary words with 
the strong parsing property. 

A binary word u is a prefix of a binary word w if and only if w = uv for 
some (possibly empty) word v. A list w1,..., Wm of binary words satisfies the 
prefix condition if and only if whenever 1 <i, j < m andi ¥ j, then w; is not 
a prefix of w;. An encoding scheme s; > w;,i = 1,...,m, satisfies the prefix 
condition if and only if the list w;,..., wm does. It is common usage to say that 
such a scheme defines a prefix-condition code, or simply a prefix code. Take 
note that this terminology is somewhat misleading, because “prefix code” is 
characterized by an absence of prefix relations in its encoding scheme. 

The practical importance of prefix codes is encapsulated in Theorem 4.2.1, 
which says: 


5.2.1 Every prefix code is uniquely decodable, with reading-left-to-right serv- 
ing as a valid decoder-recognizer. 
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In case you have not read Section 4.1, here is a translation: 


5.2.2 Ifs; > wj, j =1,...,m, is a prefix-condition encoding scheme, and if 
for some subscripts 1 <ij,...,i, ji,...,jr <M, Wi, +++ Wi, = Wj, -+- W;,, then 
t=r and ig = jx, k =1,...,t. Furthermore, the source word sj, ---s;, can 
infallibly be recovered from the code word wj, --- w;, by scanning left to right 


and noting the source letter s;, each time a word w;, from the encoding scheme 
is recognized. 


(Actually, the assertion that “reading-left-to-right is a valid decoder-recog- 
nizer” for a code given by a particular scheme is a stronger statement than is 
given in the last part of 5.2.2, because of the word “recognizer”—the last as- 
sertion in 5.2.2 says that reading-left-to-right is a valid decoder for a prefix 
code—but we will not tarry further over this point.) 

In the example in the preceding section, the encoding scheme («) satisfies 
the prefix condition, and if you followed the example you experienced directly 
the pleasures of reading-left-to-right with respect to the scheme (**). This sort 
of decoding is also called instantaneous decoding, for reasons that should be 
obvious. 

Pretty clearly, 5.2.2 is telling us that lists of words with the prefix condition 
satisfy something resembling the uniqueness provision of the strong parsing 
property. We will give the full relation between the prefix condition and the 
SPP after restating the Kraft and McMillan Theorems for binary codes. More 
general statements and proofs are given in Section 4.2. 


5.2.3 Kraft’s Theorem for binary codes Suppose m and £1,...,&m are posi- 
tive integers. There is a list W,,...,Wm of binary words, satisfying the prefix 
condition and lgth(w;) = €;,i = 1,...,m, if and only ri i Sie 2-4 <1, 


5.2.4 McMillan’s Theorem for binary codes Suppose m and £1,...,€m are 
positive integers, and w1,...,Wm are binary words satisfying |gth(w;) = £;, 
i=1,...,m. Ifs; > wj,i=1,...,m is a uniquely decodable encoding scheme 


(see the first part of 5.2.2 for the meaning of this) then )~?"_, ga. 


McMillan’s Theorem has virtually no practical significance at this point. 
We repeat it here just because it is a beautiful theorem. Together with Kraft’s 
Theorem, what it says of pseudo-practical importance is that if your primary 
criteria for goodness of an encoding scheme are unique decodability, first, and 
prescribed code word lengths, second, then there is no need to leave the friendly 
family of prefix codes. 

Kraft’s Theorem brings us to the main result of this section. The proof is 
outlined in Exercise 5.2.2. 


5.2.5 Theorem A list w1,..., Wm of binary words has the strong parsing prop- 
erty if and only if the list satisfies the prefix condition and )-""_, 2~'8*) = 1. 
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Exercises 5.2 
1. Complete the following to lists with the SPP, adding as few new words to 
the lists as possible. 
(a) 00, 01, 10. 
(b) 00, 10, 110, 011. 
(c) 0, 10, 110, 1110. 


2. Prove Theorem 5.2.5 by completing the following. 


(a) Show that if binary words w1,..., Wm do not satisfy the prefix condition, 
then the list cannot possess the SPP. [If w; = wjv, i A j, then w; = wjv 
can be parsed in at least two different ways by w),...,Wm.-] 


(b) Prove this part of the assertion in 5.2.2: If w1,..., Wm satisfy the pre- 
fix condition, and if w;, ---w;, = wj,---w;,, thenr =f and ix = jx, k = 
1,...,¢. [Hint: if not, let k be the smallest index such that i, ~ j,. Then 
Wi, ++ Wi, = Wj,*** Wj,3 Since the two strings agree at every position, they 
must agree in the first min(Igth(wj, ), lgth(w,)) places. But then w1,..., Wm 
do not satisfy the prefix condition, contrary to supposition. Why don’t 
they?] 

(c) Suppose wj,..., Wm are binary words satisfying the prefix condition, 
and )~"_, 2-* < 1, where £; =Igth(w;), i =1,...,m. Let £ > max; £; be 
sufficiently large that 2~6 + )-”"., 2~% < 1. By the proof of Kraft’s Theo- 
rem (see Section 4.2), there is a binary word W of length ¢ that has none of 
W1,--.,Wm as a prefix. Conclude that W cannot be parsed into a string 
Wi, °+: wWi,v for some indices ij,...,7; and leave v satisfying Igth(v) < 
max; lgth(w;) (why not?) and that, therefore, w),..., Ww», does not have 
the SPP. 


(d) Suppose wj,..., Wm are binary words satisfying the prefix condition 
and yy 2-4 = 1, where @; = Igth(w;), i =1,...,m. If W is a binary 
word of length @ > max; €;, then W must have one of the w; as a prefix, 
for, if not, then the list w1,...,Wm, W satisfies the prefix condition, yet 
2-Hethw) 4 yom 2 Iethwi) — 2—£ 4 1 > 1, which is impossible, by Kraft’s 
Theorem. Conclude that every such W can be written W = wij, --- wi,v for 
some indices i;,...,i; and v satisfying lgth(v) < max; ?;.. [How do you 
arrive at this conclusion? ] 


Put (a)—(d) together for a proof of Theorem 5.2.5. 


3. What are the analogues of the SPP and Theorem 5.2.5 for non-binary code 
alphabets? 


4. Show that any list w1,..., wx of binary words satisfying the prefix condi- 
tion can be completed to a list w1,..., Wm with the SPP. [You will probably 
need to use a result, mentioned in part (c) of Exercise 2, above, which is 
embedded in the proof of Kraft’s Theorem. Incidentally, the result of this 
exercise holds for non-binary alphabets, as well.] 
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—————_— eee 


5.3 Choosing an encoding scheme 


Suppose that binary words sj, ...,5;_ with the SPP have been settled upon, and 
the original file has been parsed (at least hypothetically) into what we will con- 
tinue to call a source string, Z = Sj, --- Siy , together with, possibly, a small leave 
at the end. [How does one select the best source alphabet, S = {s1,..., 5m}, for 
the job? Not much thought has been given to this question! Recall from previ- 
ous discussion that the blue-collar solution to the problem of choosing S is to 
take § = {0, 1}4, usually with L = 8.] 

Now consider the problem of deciding on an encoding scheme, s; > w;, 
i=1,...,m, for the replacement of the s; by other binary words, the w;, so that 
the resulting file U is as much shorter than the original as possible, and so that 
the original file is recoverable from U. It suffices to recover Z from U. (The 
leave, if any, is handled in separate arrangements.) 

Pretty clearly the length of U will be completely determined by the lengths 
of the w;. McMillan’s Theorem now enters to assure us that we lose nothing by 
requiring that w1,..., W Satisfy the prefix condition—which takes care of the 
recoverability of Z from U. Plus, the recovery will be as rapid as can reasonably 
be expected. 

So we are looking for w),...,Wm, satisfying the prefix condition, such 
that w;, --: Wiy is of minimal length. Common sense says, find a prefix code 
that assigns to the most frequently occurring s; the shortest w;, and to the least 
frequently occurring s; the longest w; (which will be no longer than necessary). 
Common sense is not mistaken, in this instance. (See Theorem 4.3.3, after 
reading what follows.) But common sense still leaves us wondering what to do. 

Let us focus on the phrase “most frequently occurring”; let, as in Chapter 4, 
ji stand for the relative frequency of the source letter s; in the source text. To put 
it another way, fj is the proportion of s;’s in the source string Z. For instance, 
if Z= 515454525351 54, then fi = 2/7, ha = 1/7 = ich and ff = 3/7. (And what 
about fs, in case m > 4? fs = 0 in this case.) Note again that }°"", fi = 1. 
Recall that f1,..., fm can be thought of as the probabilities assigned to the 
distinct outcomes of a probabilistic experiment; the experiment is choosing a 
source letter “at random” from the source string Z, and fj; is the probability that 
the letter chosen will be s;. 

Recall that if s; > w; € {0, 1}4 is an encoding scheme, the average code 
word length of the scheme is = )~”"_, f;£;. The length of the file U obtained 
by replacing the s; by the w;, in Z, will be 2N, where N is the length of Z as a 
word over S = {51,...,5m}. (Verify! This assertion arises from the definitions 
of the fj and of @.) If the binary word s; has length L;, and L = 7", fiLi, 
then the length of the original file W, parsed by the s; into Z (we neglect the 
leave), has length LN. Thus the compression ratio would be LN/€N = L/2. 

The problem we face here is the same as that enunciated in Section 4.3: 
given S1,..., 5 and f1,..., fm, find a prefix-condition scheme s; > wj; € {0, 1} 
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which minimizes = yr, Siti. We know how to solve this problem; the solu- 
tion is Huffman’s algorithm, discovered in 1951 (published in 1952) by David 
Huffman, then a graduate student at M.I.T. He discovered the algorithm in re- 
sponse to a problem—the very problem to which his algorithm is the solution— 
posed in a course on communication theory by the great mathematician, R. M. 
Fano. (Professor Fano did not tell the class that the problem was unsolved!) 

In Section 4.3 Huffman’s algorithm is described approximately as Huff- 
man described it in his paper of 1952 [36]. Later in this section we will take 
another look at it and give an equivalent description, involving tree diagrams, 
which turns out to be much easier to implement and lends itself to more clever 
improvements than the merge-and-rearrange, merge-and-rearrange procedure 
described in 4.3. 

Before that we will look at two other algorithmic methods for approxi- 
mately solving the problem of finding an optimal prefix-condition encoding 
scheme, given s1,...,5, and f|,..., f,. The first is due to Claude Shannon, 
who is to the theory of communication as Euclid is to geometry; the second is 
due to the aforementioned R. M. Fano. Both are given in Shannon’s opus [63], 
published in 1948. They do not always give optimal encoding schemes, but they 
are of historical and academic interest, they leave unanswered questions pursuit 
of which might be fruitful, and Shannon’s method provides a proof of an impor- 
tant theorem that tells you approximately how good a compression ratio can be 
achieved by zeroth order replacement, before you do the hard work of achieving 
it. 


5.3.1 Shannon’s method 


Given s1,...,5m and f1,..., fm, rename, if necessary, so that fj >---> fin > 0. 
(Any source letters that do not appear are deleted from the source alphabet.) 
Define F, = 0 and Fy = os, fi, 2<k<m. Let lx be the positive integer 
satisfying 2—% < fy, <27-*!. In other words, €; = [log, te | (Verify! Inci- 
dentally, [-] stands for “round up.”) 

In Shannon’s method, the encoding scheme is s; > w;, j =1,...,m, with 
wj; consisting of the first £; bits in the binary expansion of F;. Two remarks are 
necessary before we look at an example. 

Dyadic fractions are rational numbers of the form p/2”, with m a non- 
negative integer and p an integer. If 1 < p <2”, then p= Be, aj2/ for some 
(binary) digits ao,...,@m—1 € {0,1}; it follows that p/2” = ae aj2i-™ = 
ye Gm—k2-* = (.dm—1-++:d0)2. The point is, every dyadic fraction between 0 
and | has a “finite” binary representation, meaning a binary expansion with an 
infinite string of zeros at the end. And conversely, every number between 0 and 
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1 with a finite binary representation is a dyadic fraction: 


m m 
‘ 1 : 
(bi -++bm)2 = Dee = om ybj2"-J = p/2”, 
j=l j=l 


with p an integer between 0 and 2” — 1, inclusive, if bj,...,bm € {0, 1}. 
Because 1 =)’, 2-* the non-zero dyadic fractions have the unpleasant 
property that they each have two different binary representations, one ending in 
an infinite string of ones. For instance, 
3. 1 
43° 


1 = 
aT (.11)2 =(.1011---)2 =(.101)2 
1 lee) 
—_ —k 
= 5+ ).2 
k=3 


(We will sometimes use the common bar notation, as in (.101)2, to indicate 
infinite repetition. Thus (.11101)2 = (.11101101---)2.) This is a nuisance be- 
cause, in the description of Shannon’s method, we have the reference to “the 
binary expansion of Fj.” What if F; is a dyadic fraction? The answer is that, 
here and elsewhere, we take “the” binary expansion of a dyadic fraction to be 
the “finite” one, the one that ends in an infinite string of zeros. 

Incidentally, if you have never checked such a thing before, it might be a 
salutary exercise to prove that the non-zero dyadic fractions are the only num- 
bers with two distinct binary expansions. Or, you can take our word for it. 

The second remark concerns the following question: given a number r € 
(0, 1), how do you determine its binary expansion if r is given as a fraction or 
as a decimal? Here is a trick we lifted from Neal Koblitz [41]. Start doubling. 
Every time the “answer” is less than one, the corresponding binary digit is 0. 
Every time the “answer” is > 1, the next binary digit is 1; then subtract | from 
the “answer” before doubling again. 

For instance, if r = 2/7, the process looks like: 


double after adjustment | 4/7 | 8/7 | 2/7 | 4/7 
binary digit 0 1 0 0 


So, 2/7 = (.010)>. 
Another instance: suppose r = .314, in decimal form. The process looks 
like 


double after adjustment | .628 | 1.256 | .512 | 1.024 | .048 
binary digit 0 1 0 1 0 
096 | .192 | .384 | .768 | 1.536 | 1.072 | .144 
0 0 0 0 1 1 0 


and we mercifully stop here, since the “period” of the binary expansion could 
be 499 bits long, and we are not really interested in finding out just how long it 


© 2003 by CRC Press LLC 


5.3 Choosing an encoding scheme 129 


is. Anyway, we have the first 12 bits of the binary expansion: 
314 = (.010100000110---)2 


We leave it to those interested to ponder why Koblitz’s method works. Think 
about the effect of doubling a number between 0 and | already in binary form. 


Examples of Shannon’s method Let’s first recall the example in Section 5.1. 
After parsing, the source string was 555255545455515453 = Z. So fi = fo = fa = 
1/9, f4 = fs =3/9. But wait! Shannon’s method applies to source frequencies 
in non-increasing order, the reverse of what we have here. So put < = S5—i4+1, 
f; = fs—i+1 and let’s follow Shannon’s method with the F; in the recipe re- 
named F; and the ¢, renamed £,. 

It is easy to see that ¢; = €, = 2 and 3 = ¢, = £5 = 4. [It is easy to 
calculate the €, in Shannon’s recipe if you bear in mind that €; is the “first” 
positive integer exponent such that that power of 1/2 is < f;. So, for instance, 
if fy = 1/9, the first power of 1/2 less than or equal to 1/9 is 1/16 = (1/2)4, so 
£, = 4.] Using our method for calculating binary expansions, we have 

F, =0=(.00...)2 
B= 3/9= (01.3 
F, = 6/9 = (.1010...)2 
Fy =7/9 = (.1100...)2 
F5 = 8/9 = (.1110...)2 
Thus the encoding scheme given by Shannon’s method is 
s5= 5 = 00 
S4= 55 > 01 
3= 83 — 1010 
sy = 54 > 1100 
Sp= 56 — 1110 
This gives € = 2(3/9) + 2(3/9) + 4(1/9) + 4(1/9) + 4(1/9) = 8/3; we also 
have L = 1/9+2/9+ 3/9 + 4(3/9) + 4(3/9) = 10/3, for a compression ra- 
tio of L/£ = 5/4 (which you can verify directly by plugging the code words for 
the s; into Z to obtain a compressed string U with 24 bits, compared to the 30 
bits we started with). Since we achieved a compression ratio of 3/2 with the 
prefix code (**) in Section 5.1, this example shows that Shannon’s method may 
fail to give an optimal encoding scheme. The shrewd will notice that we can 
modify our encoding scheme by lopping off the final zero of the code words 
for 53, 52, and s;, and still have a prefix code. Even this modification does not 
optimize compression; we would have ¢ = 7/3 and L/€ = 10/7 < 3/2. 

Another example: m = 6 and fj =.25> fo= fp =.2> fa= fs=.15> 
fo = .05. (Here we have already renamed, or rearranged, the source letters 
S1,-.-,56, If necessary, so that their relative frequencies are in non-increasing 
order.) We compute 01 = 2, £2 = €3 = 4 = €5 = 3, 6 =5, 
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F, =0= (.00...)5 F4 = .65 = ( 
Fy =.25=(.010...)9 Fs =.80= (.110...)9 
Fy=AS5=(0ll...)2 Fo =.95= (.11110...)o 


so the encoding scheme is 


51 > 00 s4 — 101 
S272 010 S57 110 
53011  s6 > 11110 


with @ = 2.85 (check!). We cannot compute the compression ratio without 
knowing L; still, we will see later that 2 = 2.85 is far from optimal in this 
situation. Note that we could obtain a shorter prefix code by lopping off the last 
two bits of we = 11110, but the new value of é, 2.75, is still far from optimal. 

So, the experimental evidence is that Shannon’s method is not so great. We 
will see that it has certain charms, and is worth knowing about for “academic” 
reasons, in the next section. 


5.3.2 Fano’s method 


Once again, assume that s1,...,5 and f; >--- > fm are given. In Fano’s 
method, you start by dividing 51, ..., 5; into two “blocks” with consecutive in- 
dices, s1,..., 5% and sg41,..., 5m, with k chosen so that yf, fj and ee a 
are as close as possible to being equal. The code representatives w1,..., wx of 
S],.-., 5 Start (on the left) with 0, and wx+1,..., Wm start with 1. At the next 
stage, the two blocks are each divided into two smaller blocks, according to the 
same rule, and so on. At each new division, a zero is appended to the code 
representatives of the letters in one of the new blocks, and one is appended to 
the code representatives of the letters in the other. Blocks consisting of a single 
letter cannot be further subdivided. 


Examples of Fano’s method We will apply Fano’s method in the two cases to 
which we applied Shannon’s method. In the first case, we will not fuss about 
the renaming, as we did in 5.3.1. 

For the first example, consider 


Frequency Letter Code word 
3/9 S] — 00 
3/9 SS Ol 2 
19 a se 10 ! 
1/9 ee TO . 
1/9 i = 1d ‘ 


The number on the right indicates the order in which the divisions were drawn. 
Observe that we do not, for instance, group s; and s3 together in the first block, 
with 52,54, 55 in the other, even though 4/9 and 5/9 are more nearly equal than 
2/3 and 1/3. Once the s; are arranged in order of non-increasing frequency, the 
blocks must consist of consecutive s;’s. 
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Could we have started by putting s; in a block by itself, and s2,...,55 in 
the other block? Yes, and, interestingly, the encoding scheme obtained would 
have been like (**) in 5.1. The average code word length for each scheme is 
€=20/9. 

For the second example, consider 


Frequency Letter Code word 
Ss) Sy —> 00 
20 5 =e Ol 2 
20 3. OS 10 : 
15 = 1S. 110 2 
15 oe ~ 10 
05 => 1h * 


There is a choice at the second division of blocks in Fano’s method applied 
to this source alphabet with these relative frequencies. The encoding scheme 
resulting from the other choice has the same average code word length, £ = 2.55, 
considerably better than that achieved with Shannon’s method. (In fact, this is 
the best 2 achievable, with a prefix code.) 

In some cases when there is a choice in the execution of Fano’s method, 
schemes with different values of @ are obtained. For instance, try Fano’s method 
in two different ways on a source alphabet with relative frequencies 4/9, 1/9, 
1/9, 1/9, 1/9, 1/9. 

Therefore, it is not the case that Fano’s method always solves the problem 
of minimizing @ with a prefix code. In Exercise 3 at the end of this section 
we offer some examples (for two of which we are indebted to Doug Leonard) 
which show that sometimes no instance of Fano’s method solves that problem, 
and sometimes the one that does, when there is a choice, is not the one arrived 
at by the obvious make-the-upper-block-as-small-as-possible convention. 

Since Huffman’s algorithm, coming up next, solves the problem of min- 
imizing with a prefix code, Fano’s method is apparently only of historical 
and academic interest. But sometimes pursuit of academic questions leads to 
practical advances. Here are some academic questions to ponder, for those so 
inclined: 


(i) For which relative source frequencies f| >--- > fm does Fano’s method 
minimize £? 
(ii) Same question, for Shannon’s method. 
(iii) Does Fano’s method always do at least as well as Shannon’s? 


We leave to the reader the verification that Fano’s method always results in a 
prefix code. 


(OR 


5.3.3 Huffman’s algorithm 


We repeat, in abbreviated form, the account of the binary case of Huffman’s 
algorithm given in 4.2, which is essentially that given by Huffman [36]. Given 
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S],.-++,5m and f; >---> fim, start by merging s,,_; and s,», into a new letter, o, 
with relative frequency fin—1 + fin. Now rearrange the letters 51,..., 5m—2,0 
into s},...,5/,_, with relative frequencies f/ >--- > f/,_, and repeat the pro- 
cess. Continue until you have merged down to two letters, to and tj. Begin 
encoding by t9 > 0, t; — 1. Follow through the merging in reverse, coding as 
you go; if, for instance, o is encoded as w, then 5,1 > w0, 5m > wl. 

Here is another account of the algorithm which seems to lead to imple- 
mentations which take care of all that rearranging in the merging part of the 
algorithm. In the language of graph theory, we start constructing a tree, called a 
Huffman tree, the leaf nodes of which are the letters 51,..., 5, sporting weights 
fis--+> fm, respectively. The nodes s,, 5,—1 now become siblings and are at- 
tached to a parent node o,, with weight fin + fm—1. At the next stage, two nodes 
with least weight that have not yet been used as siblings are paired as siblings 
and attached to a parent node which gets the sum of their weights as its weight. 
Continue this process until the last two siblings are paired. The final parent 
node, with weight 1, is called the root of the tree. 

From each parent there are two edges going to siblings. Label one of these 
0, the other 1. (Which gets which label does not really matter, but there might be 
practical reasons for establishing conventions governing this label assignment, 
as we shall see later when we consider “adaptive” or “dynamic” coding.) 

The code word representing each s; is obtained by following the path from 
s; to the root node, writing down each edge label, right to left. 


Examples of Huffman’s algorithm We will do the two examples already 
treated by the methods of Shannon and Fano. In the first example, a Huffman 
tree can be created as follows: 


S] (1/3) 1) Scheme 

0 ! 5, > 0 
- vs 52 > 10 
83 0/9) “5 53 > 110 
S4 (1/9) 0 s4 — 1110 
55 2/9 s5 > 1111 


and @ = 20/9. The careful reader will note that there was a certain choice in- 
volved at the third “merge” in the construction of the tree above. If we make 
the other essentially different choice, we get 


S| 
Scheme 
S52 S| > 00 
52> 01 
33 53 > 10 
S4 54 > 110 
55 > 111 
55 
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and, again, = 20/9. 
For the second example, consider 


S] 
$2 
83 
S4 
S5 
56 


with @ = 2.55. 


133 


Scheme 


5; > 00 
52 > 10 
53 11 
54 — 010 
SS —> 0110 
s6 > O111 


There is one essentially different tree obtainable by making a different 
choice at the second merge, which results in a scheme with code word lengths 


2,2,3,3,3,3; again, we have £ = 2.55. 


In fact, by the proof outlined in Section 4.3.1, Huffman’s algorithm al- 
ways results in a prefix condition scheme which minimizes @, and for every 
p.c. scheme minimizing @ (given f|,..., fm), there is an instance of Huffman’s 
algorithm that will produce a scheme with the same code word lengths as the 


given scheme. 


Exercises 5.3 


1. (a) 51 =00, 52 =01, 53 = 10,54 =11; fi=.4, fp =.25= fs, fa=.1. Find 


the compression ratio if the s; are encoded by Huffman’s algorithm. 


(b) Same question, with 


s,=00 fl= 
s= Ol ha = 
53=100 fz= 


fa=.2 
s5=110 f5=.15 


fo = -25 


2. Find the encoding schemes and the compression ratios when the methods 
of Shannon and Fano are applied in 1(a) and 1(b), above. 


3. Process the following by the method of Fano, taking every possibility into 
account. When are the results optimal? (You could use Huffman’s al- 
gorithm on the same data to determine optimality.) The relative source 


frequencies are: 


(a) .38, .24, .095, .095, .095, .095 
(b) 5/13, 2/13, 2/13, 2/13, 2/13 


(c) .4, .15, .1, .09, .09, .09, .08 
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———— eee 


5.4 The Noiseless Coding Theorem and Shannon’s 
bound 


Here are two good things about Shannon’s method. 
5.4.1 Shannon’s method always results in a prefix-condition encoding scheme. 


Proof: Suppose s1,...,5m and fj; >--- > fm > 0 are given. Since €, = 
[logy fools we have £; <--- < &,. Therefore, since €; is the length of the 
code word w,; to which sx, is assigned in the encoding scheme arrived at by 
Shannon’s method, it will suffice to show that for 1 <k <r <m, w, is nota 
prefix of w,. 

For such k andr, 


Fy > Fezt = Fut fe = Fe +2. 


If the binary expansions of F,. and F; were to agree in the first £2; positions, then 
the most they could differ by would be pa k+l 2-/ =2-*, and they could only 
differ by this much if the binary expansion of one of them were all zeroes from 
the (k + 1)st place on, and the binary expansion of the other were all ones from 
that point on. None of our binary expansions end with an infinite string of ones, 
by convention. Therefore, if the binary representations of F;. and Fx agree in 
the first ¢; places, they are less than 2~ apart. Since they are, in fact, at least 
2-“ apart, it follows that their binary representations do not agree in the first 
£, places, so w x is not a prefix of w;. O 


5.4.2 Given S = {s1,...,5m} and fi,..., fm, the encoding scheme resulting 
from Shannon’s method has average code word length € satisfying € < H +1, 
where H = H(S) = (Ly fe log, f;- 


Proof: 

m m 
€=)> file =~ filo fe 
k=1 k 


=! 


m 
< > fedogs fe! +1) 
k=1 
m 


=H+)) fe=H+1. oO 
k=1 


Those who have read certain parts of the first four chapters will recognize 
H(S) as the source entropy. More exactly, H(S) is the source entropy when 
the “source” is a random emitter of source letters, with s; being emitted with 
relative frequency (probability) f;. We will consider slightly more sophisticated 
models of the “source,” that mysterious and, legend has it, imaginary entity that 
produces source text, in Chapter 7. 
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Assertion 5.4.2 is half of the binary version of the Noiseless Coding The- 
orem for Memoryless Sources, stated in full, for code alphabets of any size, as 
Theorem 4.3.7. We will give the proof for the binary case here and relegate the 
proof of the more general theorem to the problem section. We owe our proof of 
the second half of the theorem to Dominic Welsh [81] or to Robert Ash [4]. 


5.4.3 Noiseless Binary Coding Theorem for Memoryless Sources Given S 
= {81,...,5m} and fi,..., fn > 0, with f; being the relative frequency of s; 
in the source text; then the smallest average code word length £ achievable by a 
uniquely decodable scheme s; > wj € {0, 6 satisfies H(S) < l< A(S)+1. 
Furthermore, € = H (S) is achievable if and only if each f; is an integral power 
of 1/2. 


Proof: £ < H(S)+1 follows from 5.4.1 and 5.4.2. Now suppose sj > wj € 
{0, 1}°7 is any uniquely decodable scheme, and ¢ = 4 fj £;. By McMillan’s 
Theorem, G = 0,274 <1. Set qx =2-%/G, k=1,...,m. Note that 


ee ke = 1. 
We will use Lemma 2.1.1, which says that Inx < x —1 for all x > 0, with 
equality only if x = 1. It follows that 


m m 
Yo filogs f7'- 0 filogeg;' 


j=l jal 


“oO y 7p in#é < oat Ht =) 
j=l fi fi 


= log,(e)() 547 — _ fi) =loga(e)(1 — 1) =0 


Thus H(S) = 7. fj log f; | < WL, fj logy q; | with equality if and only 
if qj =fj,j= 1,...,m 
We also have 


m m 
>- filoz.q;! = 2 flog, (2"G) 


j=l 


Ms 7 


fytj + (og 6) > fi < Shit =, 


j=l j=l 


Il 
— 


j 


since G < | implies log, G < 0. We have equality in this last inequality if and 
only if G = 1. Thus H(S) < 2, and equality implies fj=q= 2-4/G= 
8/1 =2-4, j=l,...,m 
This proves everything except that 2 = H can be achieved if ij= 24, j= 
1,...,m, for positive integers €1,..., &m. However, in this case the code word 
lengths in the scheme resulting from Shannon’s method are precisely €1,..., £m, 
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so the @ for that scheme is 
m 


LS yo Fe = Lf flog f; | = H(S). oO 
j=l 


For instance, returning to the ate in 5.1, with fi = fo = fg = 1/9, 
f4 = fs = 3/9, and L = 10/3, the Noiseless Coding Theorem says that the best 
£ we can get with a prefix-condition scheme satisfies 


2113~ H <£<H+13.113, 


so the best compression ratio we can hope for by the method of this chapter 
with this choice of source alphabet satisfies 
1.07 + ve < ae < ue = 1.577. 
H+1 8 H 

(In fact, the compression ratio of 1.5 achieved in 5.1 is the best possible, be- 
cause the encoding scheme there was generated by Huffman’s algorithm, which 
always gives the smallest 2 among those arising from prefix-condition schemes.) 

The Noiseless Coding Theorem has practical value. If you have chosen sj, 

.,; 5m binary words with the SPP, and determined, or estimated, f1,..., fin, 

their relative frequencies in the source text obtained by parsing the original 
file, then you know L = »; filgth(sj) and H =); fj log, te 5 the Noiseless 
Coding Theorem says that you cannot achieve a greater compression ratio, by 
replacement of the s; according to some prefix-condition encoding scheme, than 
L/H. If L/H is not big enough for your purposes, then you can stop wasting 
your time, back up, and either try again with a different source alphabet, or try 
some entirely different method of data compression. 

In the context of this chapter, with 51,..., 5, and f1,..., fm given so that 
L and H are calculable, L/H is called the Shannon bound on the compression 
ratio. Shannon noted that, under a certain assumption about the source, there is 
a trick that enables you to approach the Shannon bound as closely as desired, 
for long, long source strings. This trick is contained in the proof of the next 
result. 


5.4.4 Theorem Suppose that S = {s1,...,5m} is a set of binary words with the 
SPP, and suppose we confine our efforts to files which are parsed by S into 
source strings in which the s; occur randomly and independently with relative 
frequencies f;. Then for any € > 0, it is possible to achieve a compression ratio 
greater than (L/H) —€, with lossless compression, on sufficiently long files, 
where 


3 


m 
L=) 0 fjlgth(s;) and H=)° fjlog, f7' 
j=] j=l 


mn 
Il 


Proof: The trick is, instead of encoding s1,...,5m, we take as source alphabet 
S%, the set of all words of length N over S, where N is a positive integer that 
we will soon take to be large enough for our purposes, depending on e. 
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The assumption about the source (which is essentially unverifiable in real 


life) implies that the relative frequency of the word sj, ---sj, € S N among all 
source words of length N in long source texts, will be the product fj, --- fin. 
Therefore, 
H(SX\)= Of t fiy logy (fin + Fi) 
1<ij,..., in <m 
N 
= fie fiy Clow f7 
1<i, jeasy: in <m j=! 
m 
=D filog f(D) fig fy) boot 
ij=l 1<io,...,in <m 
m 
+ > fiy logs f,'( BD fae Fina) 
in=l 1Sij,..., in—1<m 


Now, for instance, iki iv<m fin’** fiy = Oo, hye fiy) = 1, 80 


m m 
H(S%) = )> fi, logy fy) +--+ > fiv loge fx 
ij=1 in=1 
m 
= NYO filog, f,' =NH(S)=NH. 
i=1 

Also, by the statistical principle that the average of the sum is the sum of 
the averages (see Section 1.8), the binary words represented by the source words 
Si, Siy € SN" will have average length NL. 

By, say, Shannon’s method (see 5.3.1), S% can be encoded with a prefix- 
condition scheme with average code word length 0(N) < H(S%)+1=NH +41. 
Thus the compression ratio achieved over original files so long that lengths of 
the source strings obtained by parsing them are considerably greater than N will 


be 
NL NL L L 
=——> = —— > — -€ 
((N) NH+1 H+ H 
for N sufficiently large. O 


The trick contained in the proof, of jazzing up the source alphabet by re- 
placing it by the set of all words of length N over it, with introduced probabili- 
ties defined by 


(freq. of 5;,---Siy) = fi, ++ iy» 


is worth remembering. The assumption about the source in Theorem 5.4.4 is 
essentially the same as the assumption that these defined probabilities are valid. 
When would they not be? Well, for instance, if S = {51, 52, 53, 54} and the source 
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string consists of 5151535254 repeated over and over, then f| = 2/5, fo = fh = 
f4 = 1/5, but the probability of, say, with N = 4, 52515253, is zero, not as. 

However, in practice, even if there is some orderliness in the source which 
violates the assumption in Theorem 5.4.4, it often happens that imposing prob- 
abilities on S" by multiplication is not a bad approximation to reality, and the 
compression ratios obtained by applying, say, Huffman’s algorithm to S% with 
those assumed probabilities results in improved compression. 

A last note on the Shannon bound: does it have anything to tell us about 
how to choose a source alphabet? The choice of a source alphabet affects both L 
and H. In principle, we want L to be large and H to be small. Roughly speak- 
ing, L = ae fj \gth(s;) will be large when the longer s; have larger relative 
frequency f;; and from the basics about entropy, in Chapter 2, H will be small 
when f; is negligibly small except for a very few f;. [H is zero when one f; 
is 1 and the rest are zero.] So Theorems 5.4.4 and 5.4.3 verify common sense: 
we can achieve large, handsome compression ratios if we can choose s1,..., Sin 
such that the longer binary words s; occur with great frequency, relatively. This 
does not tell us how to find s;,..., 5, it just gives some targets to shoot for. 


Exercises 5.4 


1. Compute the Shannon bound L/H on the compression ratio in both parts 
of Exercise 5.3.1. 


2. Let S = {s1, 52,53, 54} with 


sp =1l1l1 f=A4 
so = 110 fra=3 
53 =10 f= .2 
54 =0 f4a=.l 


(a) Find the compression ratio if the s; are encoded using Huffman’s al- 
gorithm. 

(b) Find the compression ratio if St= {sjsj; 1 <i, j <4} is encoded using 
Huffman’s algorithm, assuming that the relative frequency of s;5; is 
Si fi- 

(c) Find the Shannon bounds on the compression ratios in (a) and (b). 
(Hint: if they are not the same, then something is wrong!] 


3. Suppose that § = {0, 1}/ for some positive integer L and all source char- 
acters are equally likely. Compute the Shannon bound on the compression 
ratio in this case, and the compression ratios actually achieved by the meth- 
ods of Shannon, Fano, and Huffman. 


4. Suppose S = {s,..., 5} is a source alphabet with relative source frequen- 
cies fj = (1/2)¢, where £1,..., £m are positive integers. Show that Shan- 
non’s method results in an encoding scheme with average code word length 
£ = H(S). (Hint: this demonstration appears somewhere in Section 5.4.] 
Do Fano’s and Huffman’s methods do as well in such a case? 
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5. Prove the Noiseless Coding Theorem for code alphabets A = {aj,..., an}, 
n > 2. The statement is just the same as for the binary case, except that 
log, is replaced by log,,. In the version of Shannon’s method used in one 
part of the proof, binary expansions are replaced by n-ary expansions. The 
version of McMillan’s theorem to be used in the other part of the proof is 
the general one, to be found in 4.2. 
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Chapter 6 


Arithmetic Coding 


As in the preceding chapter, we have a source alphabet S = {s1,..., 5} and rel- 
ative source frequencies f1,..., fm, presumably estimated by a statistical study 
of the source text. However, in arithmetic coding it is not the case that indi- 
vidual source letters, or even blocks of source letters, are replaced by binary 
code words (although replacing blocks of source letters by binary words de- 
rived arithmetically is an option; see Section 6.3). Rather, the entire source text, 
Si, ‘** Siy, 1S assigned a codeword arrived at by a rather complicated process, to 
be described below. 

Methods of arithmetic coding vary, but they all have certain things in com- 
mon. Each source word sj, ---Sjy i8 assigned a subinterval A(i},...,in) of the 
unit interval [0, 1). This assignment takes place in such a way that A(1),..., 
A(m) are disjoint subintervals of [0, 1), and for N > 1, the m intervals A(i1,..., 
in—1,1),...,A(i,..-,4N—1,™m) are disjoint subintervals of A(i1,...,i—1); the 
lengths of these subintervals are to be proportional, or roughly proportional, to 
Fissces Fre 

Having determined (in principle) the interval A(i1,...,iy) = A, the arith- 
metic encoder chooses a number r = r(i;,...,i) € A and represents the source 
word sj, ++ Siy (which is usually the entire source text) by some finite segment 
of the binary expansion of r. Arithmetic coding methods differ in how r is ar- 
rived at, and in how much of the binary expansion of r is taken to encode the 
source word. Enough of the binary expansion of r will have to be taken so that 
the decoder will be able to figure out (in principle) in which of the intervals 
A(i,,...,iy) the number r lies; from this the decoder can recover the source 
word sj, -+- Siy. Usually, the smaller the interval A(i),...,in) is, the farther out 
you will have to go in the binary expansion of any number in it to let the decoder 
know which interval, and thus which source word, is signified by the code. For 
this reason, the larger the intervals A(i1,...,iy) are, the better the compression, 
because the code representative of the source text will be shorter, on average. 

Therefore, in “pure” arithmetic coding, the intervals A(ij,...,in), 1 <i, 
...,4N <™m, are not only disjoint, they partition [0, 1). Also, you can see the 
justification for making the lengths of A(i1,...,i%, 7), | < j <m, proportional 
to or, at least, increasing functions of, the relative source frequencies f;, for 
each k. This policy will result in the more likely source texts s;, --- sj, , in which 
letters of higher relative source frequency predominate, being assigned longer 
intervals A(i,,...,i), and will therefore achieve better compression, on the 
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average, than any perverse policy which inverts the order of the lengths of the 
A(ij,..., 4, j) relative to the f;. More about this later. 

The first variety of arithmetic coding that we will consider is probably the 
best zeroth-order non-adaptive lossless compression method that will ever exist, 
if such methods are to be judged by the compression ratios achieved over a wide 
range of source texts. The particular version that we will present in Section 6.1 
is not, in fact, in use in the real world. We think that it conveys the main idea 
of arithmetic coding better than the real world implementations, in which the 
main idea is somewhat obscured by practical tinkering. Further, since this is a 
textbook and not a how-to manual, it is appropriate that general paradigms be 
presented whenever possible. The method of Section 6.1 can be modified in a 
number of ways to be more practicable, but it might be difficult to go from one 
of these offspring to another, without understanding their parent. 

We will look at the compression ratio achievable by the method of Section 
6.1 in Section 6.2, and then consider some of the drawbacks of the method, 
and possible modifications to overcome those drawbacks, in 6.3. In Section 
6.4 we will present a full-fledged practical implementation of arithmetic coding 
which overcomes every problem with the “pure” method of Section 6.1, at the 
cost of a certain amount of fudging and approximation that may diminish the 
compressive power of arithmetic coding, but which seems to compress as well 
as or better than Huffman encoding, in practice. 


Sa 


6.1 Pure zeroth-order arithmetic coding: dfwid 


The ‘dfwld’ in the title of this section is the acronym for dyadic fraction with 
least denominator. The plan will be to select the dyadic fraction r = ie pan 
odd integer (or p = 0 when r = 0), with L as small as possible, as the represen- 
tative of the interval A = A(ij,...,ix). To see why we do this, observe that if 
the decoder is supplied the source word length N and a number in A, then the 
decoder can recover the sequence i},...,ij, and thus the source word 5j, --+ Sin. 
(The decoder knows how the intervals A(i},..., i), 1 <i},...,i)y <m, are cal- 
culated, so the decoder could, in principle, calculate them all and then pick the 
one containing the given number. We shall see a more efficient method of calcu- 
lating i,,...,i, from r and N later in this section.) If bj ---b; is a binary word 
and p = (.b,---b;)2 € A, then p is a dyadic fraction in A, with denominator 
21, where L < q <t; L <q because r is the dfwld in A. Therefore, the binary 
expansion of r, r = (.d,---az—11)2, supplies the shortest possible code word, 
namely a,---az— 11, from which the decoder can recover the source word. 

Thus, choosing the dfwld to represent the interval A = A(i1,...,inx) is al- 
ways a good idea in arithmetic coding, no matter how the intervals are gener- 
ated. 
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About notation: when w = sj, ---s;,, a source word, it will sometimes be 
convenient to denote A(i1,...,ix) alternatively as A(s;, ---5;,) or A(w). 


Subdividing to find A(w) 
Suppose that S = {s1,..., 5} and the relative source frequencies are f) > --- > 
tm > 9. Our intervals will always be closed on the left, open on the right, 
starting with [0,1). If [a, 6) = [o,a+ 2), € = B —a, is our “current inter- 
val,” either [0, 1) or A(s;, --- 5i,_,) for some k > 2 andiy,...,ix-1 € {1,..., m}, 
we want to subdivide [a, 8) into half-open intervals with lengths proportional 
to fi,..-, fm. Here are the endpoints of the sought-for subintervals of [a, 6): 
a,at fil,at(fit foyl,...,a+ Oo fil, B. That is, the jth subinter- 
val, 1 < j <m, is [a+ ered fplat (Nis; fi)£), where, when j = 1, the 
sum over the empty set of indices is interpreted as zero. It is recommended 
that the reader verify that the length of the jth subinterval is f;£ and that the 
right-hand endpoint of the mth subinterval is 6. [Recall that £ = 6 — a and that 
Vea) 

The process of subdividing these intervals is illustrated below, with S = 
{a,b,c,d}, fa = 4, fp =.3, fo =.2, and fg =.1. 


A(a) A(b) A(c) A(d) 


[i oa NY i aa 
[i AX A A 
0 4 4 mee I 
we A(bb) A(bc) A(bd) 
[= \ Y NYO) 
L AS A ALD 
4 52 61 67 7 
A(baa) A(bab) A(bac) A(bad) 
[ FS NY NYO) 
bay AL A A 
4 448 484 508.52 


Meanwhile, A(a) and A(d) are subdivided as follows: 


A(aa) A(ab) A(ac)  A(ad) 
C an ss eae ba 
[i Ac AL AL? 
0 16 28 36 64 

A(da) A(db) A(dc) A(dd) 
[ a YN Ne), 
pe i A py ee) 
29 94 97 99 1 


A(ii,...,in) 18 specified by its left-hand endpoint @ and its length 2, and 


© 2003 by CRC Press LLC 


144 6 Arithmetic Coding 


these are straightforward to compute iteratively. If the computation were orga- 
nized into a table with columns for “next (source) letter’, “left-hand endpoint 
a”, and “length @”, the table would look like this: 


next letter a L 
0 1 

Si, Does fj Si; 
Sig a e 


Si ati, fl fixe 


Examples 
Suppose that S = {a,b,c,d}, fa = .4, fp =.3, fo = .2, and fg =.1. (As usual, 
in the absence of subscripts on the source letters, we use the letters themselves 
to subscript the relative source frequencies.) We will calculate the intervals 
assigned to the source words bacb and ccda. 

For bacb, the table is 


left-hand length 
next letter endpoint a L 
0 1 
b 4 3 
a 4 12 
c 44 (.12)(.7) = .484 .024 
b 484+ (.024)(.4) = 4936  .0072 


Thus A(bacb) = [.4936, .4936 + .0072) = [.4936, .5008). Notice that it is very 
easy to find the dfwld in A(bacb), since clearly .5 = 1/2 is in this interval, 
and 1/2 is the dfwld in all of (0, 1); in [0, 1) only 0 = 0/1 beats 1/2 for least 
denominator, among the dyadic fractions. Thus the source word bacb would be 

encoded by 1, a single bit, by the method of arithmetic coding of this section. 
For the source word ccda, the table is 

next letter a 

0 

c oD 

c .7+(.2)(.7) = .84 04 
d .84+ (.04)(.9) = .876  .004 
a .876 .0016 


Thus A(ccda) = [.876, .8776). This time we are unlucky, and the dfwld in this 
interval is not immediately apparent. 


Nes 


Finding the dfwid in a subinterval 


Suppose that the interval is [a, 8). We consider two methods to find the best 
representative for the interval. 
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Method 1 Find the smallest integer t such that x <= 6 —- a; Le., find the 
integer f satisfying 2~' < ¢ < 2~‘*!. Then solve the inequalities a < sr < B for 
integers x. There will be at least one, and at most two, integers x satisfying this 
inequality. In case there are two, they will be consecutive. Take the even one. 
In any case, r = id (reduce to lowest terms!) is the dfwld in the interval. 

In the first example above, suppose we did not notice that the dfwld is 1/2. 
Applying this method, we would find t = 8 and set about solving 


4936 < = <.5008 or 126.3616 <x < 128.2048. 


There are two whole numbers x satisfying these inequalities, 127 and 128. We 
take x = 128 and get r = 528 = 1/2. 

In the second example, by this method, we find t = 10 and set about solving 
.876 < Tod < .8776; 897.024 < x < 898.6624. This time there is only one x, 
namely x = 898. We find 


898 449 2564+ 128+64+1 
2 — IOC - =| eo 
1024 512 512 
Thus the code for ccda is 111000001. 


= (.111000001)>2. 


Method 2 Carry out the binary expansions of @ and 6 until they differ. At the 
first place they differ, there will be a 0 in the expansion of a, and a | in the ex- 
pansion of 6; i.e., a = (.a)---a;—10a;41---)2 and 6B = (.a)---ay—1 1b;41---)2. 
In most cases, the dfwld in [a@, 6) will then be r = (.a,---a;-11)2. There are 
two annoying exceptions to this rule.! 

Exception 1. If a = (.a ---a+—1)2, then a itself is the dfwld in [a, 8). 

Exception 2. If a > (.a,---a;—1)2 (i.e., aj = 1 for some i > ft) and B = 
(a1 ---a;—11)2 (.e., bj = 0 for alli > 1), then (.a; ---a;_11)2 = B is not actually 
in [@, 8), and so cannot be the dfwld in that interval. In this case the dfwld r in 
[@, 8) is found by continuing the binary expansion of @ until a 0 is found among 
at+1,4142,°°:. If all a; beyond that point are zero, then, again, a itself is the 
dfwld in [a, 8). Otherwise, change that zero to a one and truncate the binary ex- 
pression at that point to obtain the dfwld. Examples: if a = (.101010111)2 and 
B =(.101011)2, then r = a; if a = (.10101011001---)2 and 6 = (.101011)2 
then r = (101010111)o. 

In spite of these exceptions, Method 2 is the more “machinable” of the two 
methods. Note that, with a and 6 as above, regardless of everything else, the 


'Doug Leonard points out that if @ = (.a, ---a;_10---)2 and B = (.a, ---a;_11---)2 thenr = 
(.a1 -+-az—11)2 is the dfwld in (qa, f], in all circumstances. Therefore, we could avoid those pesky 
exceptions in Method 2 by changing our way of subdividing intervals so that the intervals wind up 
closed on the right, open on the left, for the most part. In fact, there is a reasonable way to do this so 
that for all N, A(w) is of the form (a, 6] for every w € SN with two exceptions: A(s) = (0, i 
and A(sN y=d- if N , |). However, in real live arithmetic coding, to be described in Section 6.4, 
it is conventional to take intervals closed on the left, and finding the exact dfwld at the end of the 
process is not insisted upon. We decided that it would overly complicate the transition from the 
finicky academic version of arithmetic coding of this section to the implementation version in 6.4 
to have our intervals here closed on the right, and our intervals there closed on the left. 
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binary word a1 ---a;—1 will be part of the code stream. The next subsection is 
about how to take advantage of this fact. 

Notice that the annoying exceptional cases occur only when at least one of 
a, B is a dyadic fraction. Therefore, when neither is, we can forget about those 
exceptions and take r = (.a; ---a;—11)2 as the dfwld. For instance, in the first 
example on page 144, we had a = .4936 = (.0---)2 and 6 = .5008 = (.1---)2 
and neither @ nor f is a dyadic fraction, clearly, so r = (.1)2 = 1/2. Similarly, 
in the second example, leaving out the details of finding the binary expansions, 
a = .876 = (.111000000---)2, 6 = .8776 = (.111000001 ---)2, neither dyadic 
fractions, so r = (.111000001)>. 


i ee 


6.1.1 Rescaling while encoding 


The encoding method described in the preceding subsections is, in abbreviation: 
find the interval A(i1,...,in,) = A, and then the dfwldr in A. The procedure 
thus described is wasteful in that initial segments of the final code, the binary 
expansion of r, are stored twice in the endpoints a, and a; + x of the inter- 
mediate intervals A(i,...,i,). Furthermore, there is a waste of time involved: 
you have to wait until the entire source string is scanned and processed before 
you have the code for it. Recall that in compression by replacement schemes, 
you can encode as you go along, without waiting to see what is up ahead in the 
source string. There are some situations in which this is a great advantage—in 
digital communications, for instance, when speed is required and typically de- 
coding of the code string starts before encoding of the source string is finished. 

But now Method 2 of calculating dfwlds suggests a way of beginning the 
arithmetic encoding of a source string without waiting for the end of the string, 
while lessening the burden of computation of the endpoints of the intervals 
A(i1,..., 4%), as those endpoints get closer and closer together. 

If aw = (.a,---aq;_10---)2 anda+é = (.a,---az_11---)2, then the first t — 1 
digits in the binary expansion of any number in an interval with endpoints a and 
a+ will be a,---a;_1. Consequently, if r = r(i,,...,iy) is in that interval 
somewhere, then we know the first t — 1 bits of the code for sj, ---sj,. We 
extract those t — | bits and multiply a and a+ £ by 2'—! mod 1 to obtain the 
endpoints of the new current interval. Multiplying a and w+ ¢ by 2'~! mod 1 
means that we subtract the integer (a; ---a;_1)2 from 2'~'w and 2'~!(a@ + €). 
Note that this amounts to shifting a1 ---a;—1 in each of a anda + £ to the left of 
the binary point and out of the picture. By storing a; ---a;—; in the code word 
being constructed (tack it on to the right of whatever part of the code word had 
been found previously) we are really keeping track of a and a+ £, but we do 
not need to leave a; ---a;—1 in the binary expressions of these numbers—these 
extra bits complicate the calculations needlessly. 

This process of finding a; ---a;—; and then replacing the interval by a new 
interval obtained by multiplying the old endpoints by 2'~! mod 1 is called 
rescaling. Rescaling does not affect the final code word because we go from 
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A(ij,..-, 4-1) to the intervals A(ij,..., ix), 1 < ix <m by dividing the large 
interval into subintervals with lengths proportional to f\,..., fm, and because 
we rescale by multiplying by a power of 2, which preserves binary expressions 
beyond the part that gets shifted away. Let’s try it with S = {a,b,c,d}, fi = 4, 
fo = 3, fe =.2, fa = .1, and w =ccda. 


Next letter or rescale a L code so far 


Rescale (find the binary 
expansions of .7 = (.10...)9 
and .9 = (.11...)2 and shift 
out the part where they agree) 
c .68 = .4+4+ (.4)(.7) .08 = (.2)(.4) 


Rescale [.68 = (.10...)2 36 = 2(.68) — 1 .16 = 2(.08) 


and .76 = (.11...)9] 
d 504= 36+(.9)(.16)| .016=.16..1) |11 
Rescale [.504 = (.100000...)> 45 _ _ 
and 520 = (.100001...)9] | 128 = 2°¢504) — 16 | 512 = 32016) |1110000 
a 128 "2048 = (.512)(.4)| 1110000 


Rescale [.128 = (.00...)o 
and .3328 = (.01...)2] 


.256 = 2(.128) 11100000 


Now find the dfwld in [.256, .6656); it is 1/2 = (.1)2. Tack this on to the last 
“code so far” to obtain 111000001 as the code word for ccda. 

This process superficially may seem more complicated than what we went 
through before to encode ccda because of the added rescaling steps, but the 
lines on the first table for ccda that we generated were filled in at the cost of 
increasingly onerous arithmetic. Rescaling lifts the burden of that arithmetic 
somewhat and, as an important bonus, gives us initial segments of the final 
code word early on. (There is one extreme case in which this is not true; i.e., 
in one case the partial code words supplied by rescaling are not prefixes of the 
final code word. See Exercise 6.1.5.) 

However, as the preceding example shows, rescaling does not supply those 
initial segments at a regular pace, as the encoder reads through the source word. 
Furthermore, the example of bacb shows that you may not get any initial seg- 
ments of the code word at all; in that example, the binary expansions of the 
endpoints a and a + € always differ in the very first position, so there is no 
rescaling and no partial construction of the code word until the very end of the 
calculation. A little thought shows that this sort of unpleasantness—no rescal- 
ing for a long time—occurs when the dfwld in A(s;, ---s;,) is the same for 
many values of k. It is slightly ironic that those occasions when the compres- 
sion with dfwld arithmetic coding is great—when the dfwld in A(sj, --- siy) 
has a small denominator—are occasions when there is certain to be a long 
run of no rescaling in the computation of the intervals and the code word, be- 
cause r = r(ij,...,in), the dfwld in A(s;, ---s;,), will also be the dfwld in 
A(sj, °++Si,) for many values of k < N. 
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There are two inconveniences to be dealt with when there is a long run of 
no rescaling. The first is that we have to carry on doing exact computations 
of the endpoints a and a + @ with smaller and smaller values of €. This is a 
major drawback, the Achilles heel of arithmetic coding, and we will consider 
some ways to overcome this difficulty in later sections of this chapter. The other 
inconvenience is that initial segments of the code representative of the source 
text are not available beyond a certain point. 

We will deal with the first of these problems by a device called the under- 
flow expansion, which we found in the work of Cleary, Neal, and Witten [84]. It 
is essential for their arithmetic coding and decoding algorithm, to be presented 
in Section 6.4. Before describing the underflow expansion, however, we will 
make the rescaling operation more practical. 


Rescaling one bit at a time 


In the account of rescaling given above, the binary expansions of the endpoints 
a and 6 of the current interval arc worked out until they disagree. This compu- 
tation is wasteful and unnecessary. 

If the binary expansions of a and £ agree at all in the first few bits, then 
they agree in the first bit. This bit will be | if and only if 1/2 < a@ < 6, and will 
be 0 if and only if a < B < 1/2. We may as well enlarge this second case to 
a < B < 1/2, since if the current interval is [a, 1/2) then the eventually-to-be- 
discovered dfwld r in the eventually-to-be-discovered final interval is < 1/2, 
so the first bit in its binary expansion—in other words, the next bit in the code 
stream—will be 0. 

If 1/2 <a < B <1, shifting out the initial bit, 1, in the binary expansions 
of a and 6 and adding it to the code stream results in [2a — 1,26 — 1) as the 
new current interval (and the new dfwld that we are seeking is 27 — 1, if r was 
the old dfwld, somewhere in [a@, 8). If 0 <a < 6 < 1/2, shifting out the bit 0 
into the code stream results in [2a,26) as the new current interval. Thus the 
rules for rescaling one bit at a time are: if 0 <a < 6 < 1/2, replace [a@, 6) by 
[2a,26) and add 0 to the code stream; if 1/2 <a < B < 1, replace [a, 6) by 
[2a — 1,26 — 1) and add | to the code stream. 

Notice that the length € = 6 — @ is multiplied by 2, in each case. Notice 
also that rescaling will not be possible when and only whena < 1/2 < 6, i1.e., 
when 1/2 is in the interior of the current interval. 

Although rescaling one bit at a time superficially seems to increase the 
number of operations in dfwld encoding, it in fact provides a “machinable” 
and efficient way of carrying out the computation of the new code stored in 
the endpoints of the current interval. Here is the encoding of ccda, with source 
letters and relative frequencies as before, with rescaling one bit at a time. We use 
“x — 2x” and “x — 2x — 1” to indicate which of the rescaling transformations 
is being applied. 
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Next letter or rescale a £ New code 
1 


c : ADs 
x—>2x-1 - 
.68 = THD = 22 
ote : 


504 = THIS nCeeore 1) 
.008 .032 


.016 .064 
.032 128 
.064 256 
128 512 


.128 2048 = (.512)(.4) 
x= ve .256 .4096 


As before, 1/2 is the dfwld in the last interval, so the code obtained is again 
111000001; of course! 


The underflow expansion 


Suppose that 1/2 is the interior of the current interval [@w, 6), in the course of 
arithmetic encoding, so that rescaling is not possible. Suppose also that [a, 8) C 
[1/4, 3/4); so, we have 1/4<a<1/2<B <3/4. 

Now, the eventually-to-be-discovered dfwld r in [a, 6), whose binary ex- 
pansion constitutes the rest of the code (added on to the code already generated) 
is either in [a, 1/2) or in [1/2, B). In the former case, r = (.01...)2 because 
a > 1/4; in the latter, r = (.10...)2 because 6 < 3/4. The point is that the first 
and second bits of the binary expansion of r are different. Therefore, if you 
know one, you know the other. 

The transformation x — 2x — 1/2 doubles the directed distance from x 
to 1/2; call it the “doubling expansion around 1/2” if you like. Further, if 
r is the dfwld in the final interval to be discovered by subdividing [a, 8) C 
[1/4, 3/4), according to the source text, then 2r — 1/2 will be the dfwld in final 
interval obtained by so subdividing [2a — 1/2,26 — 1/2) C [0,1). (Verify!) 
Inspecting the effect of this transformation on r € [1/4,3/4) we see: if r = 
(.0la3aq4...)2 then 2r — 1/2 = (.0a3aq4...)2, andifr = (.10a3a4...)2 then 2r — 
1/2 = (.lajaq4...)2. That is, the effect of this transformation on r € [1/4, 3/4) 
is to delete the second bit of its binary expansion. But that bit is the opposite of 
the first bit, which will be discovered the next time a rescaling occurs. 

This leads to the following rules for using the underflow expansion. 


1. Keep track of the underflow count, the number of times that the underflow 
expansion has been applied since the last rescaling. 


2. When the current interval [w,a +) satisfies 1/4<a <1/2<a+l< 
3/4, replace a by 2a — 1/2 and ¢ by 28, and add one to the underflow 
count. 


© 2003 by CRC Press LLC 


150 6 Arithmetic Coding 


3. Upon rescaling if the underflow count is k, add 01* to the code stream if 
the rescaling transformation is x > 2x, and 10* to the code stream if the 
rescaling transformation is x — 2x — 1; reset the underflow count to 0. 
[é* means & iterated k times. When k = 0, this means the empty string.] 

Let’s try encoding babc, when S = {a,b,c, d}, in that order, with fq = .4, 
Sb = 3, fe =.2, and fa = .1, using rescaling and the underflow expansion. 


Next letter 
or rescale New Underflow 
or underflow a £ code count 


The code is, therefore, 0111101, the last 1 arising from the dfwld 1/2 in the 
final interval, [.2848, .7456). 

Notice that if the rules for encoding with the underflow expansion are used 
for encoding bacb, we wind up with a final interval containing 1/2, no code 
generated, and an underflow count of 4. In this section, we ignore final un- 
derflow counts and give the code as 1. In modified arithmetic encoding, we 
may wind up with 10+ = 10000, possibly followed by some more bits of local 
significance. 

Notice that the underflow expansion prevents the current interval from get- 
ting arbitrarily small, in arithmetic encoding. But it does not prevent the end- 
points of the current interval from having increasingly lengthy representations, 
if we require exact computation. 


eee 


6.1.2 Decoding 


We assume that the decoder has been supplied the code word for the source 
word and the length N of the source word. If the code word is 0, then the 
source word is 51 -°+- 5] = sf . In all other cases, from the code word the decoder 
knows the number r, the dfwld in the interval A(i,,...,in) corresponding to 
the sought-for source word. 

As mentioned above, the decoder could, in principle, recover the source 
word sj, ---Siy from r and N by computing all the intervals A(j1,..., jn), 
1 < ji,..., jn <m, and deciding which of them contains r. A moment’s 
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thought shows that this would be senselessly inefficient. A much more sen- 
sible approach is to find the index i; such that r € A(i,), then the index i2 such 
that r € A(ij,i2), and so on, until i, ...,in—or, equivalently, s;,,..., 8iy —are 
found. The process neatly exploits the fact that the intervals A(i;), A(i1,i2),... 
are nested, i.e., each is a subinterval of the preceding. 
Once we have found A(ij,...,ix), we are looking for the unique j € {1, 
.,m} such that r lies in the subinterval A(i1,...,ix, j) of A(@i1,..., ix). This 
j will be iz41;. How do we go about finding iz,1? This is where we have to 
refer to the method by which the interval A(i1,...,i,) is sliced up. Suppose 
A(ij,...,¢k) = [a,a+2). The endpoints of the intervals at the next level are a, 
at fil,...,a+ OF Fide. 0+ en AE, o+€. Therefore, iz41 will 


be the index satisfying 
+()0 filesr<a+( Do fide 

i<ix41 iSix41 

or, equivalently, 
r—a@ 

Tashi Ds 

i<ig4i US1K+1 
So, to sum up: having found A(i1,...,i,) =[a,a+ 2) containing r, find iz+1, 
the largest index among those j Sieh that )0;.; fi < Z*. Then 


A(@i,..stk teen) =[e+( >> ee fi)e) 


i<ig41 iSik41 


iterate the process until i,,...,ij have been found. 

Let’s try the procedure in the case S = {a,b,c,d}, fa =.4, fp =.3, fo =.2, 
and fy = .1 and the decoder is given the code word | and N = 4.” As before, 
we use the source letters themselves rather than indices. In the following table, 
a denotes the left-hand endpoint of the current interval and £ denotes its length; 
also, r = 1/2 = (.1)2: 


Next letter 


ee sme 48-22 
+ {2} ss fe seo F< 


4+ (.12)(.7) 
= 484 


and the process is complete since N = 4. 
Now, let’s try it with the code 111000001 and N = 4. We know the right 
answer: ccda. And, this time, we will put into the mix the decoder’s version 


2 Incidentally, the decoder knows the source alphabet and the source frequencies. 
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of rescaling and the underflow expansion. The rescaling and underflow trans- 
formations will be applied to r, of course. We start with r = (.111000001)2 = 
449/512 


Next letter 
or rescale 
or underflow 


449 : 
312 ~ -88 & 

wait for rescaling | x — 2x-1 
= 7 


r—-a 


We apologize for cheating by combining three rescalings of the form x — 2x 
into one, x — 8x, in the next to last line of the decoding table above. 

There is a way to simplify the decoding process that combines rescaling 
with the discovery of the next source letter. This method superficially seems 
more efficient than the decoding method (or methods, if you count with and 
without rescaling as different) used so far, but it contains a fatal defect that 
limits its applicability. We will discuss this defect in Section 6.3. Meanwhile, 
for all its practical defects, this method may be of academic interest, so here it 
is. 


Given r = r(ij,...,in) and N, we generate two sequences, r0,71,72,... 
and jj, j2,... (or, equivalently, sj,,5j,,...), a8 follows: 
1. Setro =r. 


2. For 1 <k < N, having found jj,..., j¢—-1 and rg_1, the decoder finds jx, 
the largest index such that Dee fj <Tk-1- 


3. Ifk = N, the decoder is done. Otherwise, the decoder sets r; = f (rk-1 
os ae fj) and returns to 2, with k replaced by k +1. 


You are asked to show, in Exercise 6.1.4, that the sequence j1,..., jy is actually 
the sought-for sequence i1,...,i4 such thatr =r(ij,...,in). 

For example, let us decode the code words for bacb and for ccda that were 
found on page 144. As usual, we will use source letters instead of numeric 
indices. 

Code: 1; r =ro = .5; N = 4. Decoding table: 


© 2003 by CRC Press LLC 


6.1 Pure zeroth-order arithmetic coding: dfwid 153 


rk next letter 


Thus we decode 1, with N = 4, as bacb. 


Code: 111000001; 7 = ro = +h = 0.876953125; N = 4. Table: 


k Tk next letter 
| 0 | 449/512 c 
1, | (1 449/512—.7) 
= $3 = 884765625 
=i 
5 (2) (35-7) 
= 28 = 923828125 
=e 
3) (Db TGE-% 
= SL = 23828125 


The source word decoded by this process is ccda, which is what it should have 
been. 

Notice that we do not have to find the intervals A(ij,...,i4), 1<k< N,in 
this method. Calculating those mysterious r; does the job. 


Exercises 6.1 
1. Suppose that S = {a,b,c,d} and fg = .35, fp = 3, fo = .25, and fy =.1. 


(a) Encode bbbb, abcd, dcba, and badd by the method of this section, 
assuming that the decoder will be given the source word length. 

(b) Decode 11, 010001, 10101, and 0101, assuming the source word 
lengths are all 4. 


2. One of the disadvantages of the method of arithmetic coding described in 
this section is that the encoder must supply the decoder with the length N 
of the source word that has been encoded. Supplying information of a dif- 
ferent type outside the main code stream is usually extremely inconvenient. 


What if the encoder were to supply just the code and not N? Then the 


decoder would know the dfwld r in A(ij,...,in), where s;, --- sj, is the 
source word encoded. The problem is that r may also be the dfwld in 
various intervals A(ij,...,ig¢), k < N; if r is the dfwld in A(i1,..., ix), 


then r is the dfwld in any interval A(i,...,ix,...,¢:), > k, that r happens 
to lie in, because the intervals are getting smaller and the denominator of r 
is not getting any bigger. 
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Here is one way that the encoder could communicate the source word 


length N without special arrangements. Having found A(ij,...,iy) and 
r, the encoder finds the smallest value of k such that r is the dfwld in 
A(ii,..-,é%), and then adds N —k zeros to the code string. For exam- 


ple, on page 144, bcba would be encoded 1000, while ccda would be 
encoded 111000001, the same as before, because the dfwld in A(ccd) is 
53 = (.11100001)2. 


The decoder would proceed by the recovering-the-intervals method, with 
the additional burden of checking whether or not r is the dfwld in each 
successive interval. (Since rescaling amounts to shifting out the prefixes 
in which the interval endpoints agree, this checking is not too terrible; you 
apply rescaling to r until the current value of r is 1/2 or zero.) Once r is 
the dfwld, then the decoder knows from the number of zeros remaining in 
the code how long the source word is, and can proceed to decode by any 
effective method. 


Pretty clearly there are some awkwardnesses and inefficiencies to be dealt 
with in the implementation of such a coding method, and adding those zeros 
makes the code representatives of source words longer. Still, you might 
keep this in mind as a possible solution to the problem of specifying the 
source text length in arithmetic coding. Compare it with the trick employed 
in Section 6.4, for elegance. Now, some exercise! 


(a) In Exercise 1(a) above, how would the encoding be different if the 
encoder communicates the source word length (4) by adding zeros to 
the code words? 

(b) In the situation of Exercise 1, decode 011000, assuming that the source 
word length has been indicated by those three extra zeros, as described 
above. 


3. In the situation of Exercise 1 above, suppose that decoding has to proceed 
before encoding is finished, and the encoder, by rescaling, is supplying as 
much of the code string as possible to the decoder, and well as the length of 
the source string read through so far. Suppose that the encoder informs the 
decoder that the first 3 bits of the code string are 010 and that seven source 
letters have been read (and no more bits of the code string are available 
beyond 010). What are those first seven source letters? (After attempting 
this problem, see Section 6.3.3.) 


4. Suppose that S = {5),..., 5m} is a source alphabet with relative frequen- 
cies fj >---> fin > 0, and sj, --- Sj, is a source word. For 1 < k < N, let 
A(ij,...,é) have endpoints a, and a; + €%. (No rescaling in this problem!) 
Letr € A(i,...,in). (Uf you want, take r to be the dfwldin A(i1,...,in).) 
Let ro,71,-... and j1, j2,... be the sequences described at the end of Sec- 
tion 6.1.2; iLe., ro =r, jx is the largest index among 1,...,m satisfying 
DVyea fi SMH andy = fj, et — YD j<j, £7) k=1,2,.... Prove the 
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validity of the method that involves the rx, i.e., that i, = jy, kK = 1,2,..., 


by showing that rz, = oe ,k=1,2,.... [Go by induction on k; note that 
ro= 7 =r. Next show that ij = j, andr; = ase In the induction 
step, assume that ig_-) = jg—1, re-1 = oo , and show that i, = jg and 


that r;, = ae 

5. If the first few source letters are all 5), what will the partial code word 
provided by rescaling look like? Try it with f; = .4 and source text W = 
51515151---. Now you know the only case, alluded to in Section 6.1.1, in 
which the partial code word provided by rescaling is not necessarily an 
initial segment of the final code word: it is the case in which the source text 
W is sy’. 


6.2 What’s good about dfwid coding: the compression 
ratio 


Suppose that A = [q, B) is an interval of length @ < 1 and that rf is an integer 
satisfying 2~' < ¢ <2~“—), The basis for method 1 on page 144 is the obser- 
vation that A must contain some fraction of the form x /2', x an integer; if not, 
then A would be contained in an interval of the form (Sri a) and would thus 
have length < 27’. 

Therefore, the binary expansion through the last 1 of the dfwld in A (ne- 
glecting the possibility that the dfwld might be zero) is of length no greater than 
t. On the other hand, £ < 2~“—) implies t < log,(1/£) +1. These observations 
give rise to the following. 


6.2.1 Theorem Suppose that S = {s,,...,5m} is a source alphabet, and the 
source letters have relative frequencies f\,..., fm > 0 in the source text. Then 
the average length of the code words for the source words of length N, derived 
by the arithmetic coding method of Section 6.1, is no greater than NH(S)+ 1 
where (as usual) H(S) = —}7""_, fj logs fj. 


Proof: Let (as in Chapter 7) f(i1,...,in) denote the relative frequency of 
Si, ***Siy among all source words of length N, for 1 < ij,...,in <m. Let 
t(ij,...,in) denote the length of the arithmetic code word for sj, --- sj ; So the 
average length of the code words for the source words of length N is 


fGen. 
TSijs a: in <m 


By the observations preceding this theorem, and the fact that the interval A(i,, 
...,iy) has length That fi;, we have 
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Y> fi... siw)t (it... sin) 


1<i),..., in <m 
N a4 
< YF -siv (loe2(T] fi) +1) 
i j=l 
N 
=O fer. -si) (Soe fy!) + Desi 
; j=l 7 
= »( > Fi seesin) oes fi 
ij=1 ‘1S<iy,...,in <m 


+> > Flity-esin)) loge fi 


ig=1 ‘1Si1,i3,...,in Sm 


+t D( >. Flity-ssin) log, fy! + 1 


in=1 ‘ISi,...,in-1<m 


m m m 
= 5 fi log fz'+ D> firlos fy’ +--+ Do fiy log fx +1 


iS ig=1 in=1 
= NH(S)+1. Oo 


The reader should scrutinize and ponder the equation 


> Fein Tee 


1<ip,...,iy <m 


one of N such used in the proof above. See Chapter 7. 

Theorem 6.2.1 draws a conclusion about zeroth-order arithmetic encod- 
ing, and that conclusion is that the average number of bits per source letter 
achieved by pure dfwld encoding, as in Section 6.1, is vanishingly close to the 
Holy Grail, H(S) (widely believed, although not proven, except for replace- 
ment methods by fixed encoding schemes, to be an absolute lower bound on the 
number of bits per source letter achievable by uniquely decodable zeroth-order 
coding methods; see comment 4 below). We will see a corresponding theorem 
about higher-order arithmetic encoding in Section 7.3. We prefer not to attempt 
a rigorous definition of “zeroth-order,” but it might help the reader to under- 
stand the comments below if we mention the following. Zeroth-order statistical 
coding methods are distinguished by being based on the relative source let- 
ter frequencies alone, and not on more complicated statistical information like 
the relative frequencies of certain two-letter sequences, for instance. A perfect 
zeroth-order source is a source that emits source letters randomly and indepen- 
dently, with certain probabilities. Note that it is not assumed in Theorem 6.2.1 
that the source is a perfect zeroth-order source; it is the coding method that is 
zeroth-order. 
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Comments 


1. Perhaps it should be emphasized that “the source,” envisaged by Shannon as 
a mysterious probabilistic finite state automaton (see Section 7.5), produces, or 
could produce, an infinite amount of “source text.” The usual situation is that 
we do not know the structure of the source and can only carry out statistical 
studies of finite samples of source text. The relative frequencies /|,..., fim are 
really a priori probabilities that we would be able to calculate exactly if only we 
knew the structure of the source; usually we can only hope to approximate them 
by statistical study. Roughly speaking, the Law of Large Numbers says that our 
approximations will be good with high probability if our sample of the source 
text is large. 

In Theorem 6.2.1, f1,..., fm are assumed to be exact, the true a priori 
probabilities of the various source letters being emitted by the source (at ran- 
domly chosen instants). However, analysis similar to that contained in the proof 
of Theorem 6.2.1 shows that if f|,..., fm are merely good approximations 
to the true relative source frequencies ie soa Tos in the sense that the ratios 
fil fi are all close to 1, and if these approximate relative source frequencies 
are used to subdivide intervals in zeroth-order arithmetic encoding of source 
words of length N, then the average number of bits per code representative of 
these source words will be no greater than NH H (S) re 1+, where € > 0 as 
max; |1 — (fi/fp| —> 0, and H(S)=—)>°", fi logs fis the true zeroth-order 
source entropy. Note also that H(S) will be well approximated in this case by 
— itt filog, fi. 

2. It may be useful to think of the source words of length N that are 
being encoded arithmetically as blocks of N consecutive letters taken at ran- 
dom from the infinite source text referred to above. Because we are consider- 
ing all possible source words of length N, the proof of Theorem 6.2.1 has to 
resort to an averaging process that may possibly obscure the important point 
that zeroth-order arithmetic coding with respect to the relative source frequen- 
cies fi,..., fm will encode any source word w in around — log, €(w) bits or 
less, where, for w = 5;,---Siy, €(wW) = Ty Fijs the length of the “final in- 
terval” A(w). Notice that if w is completely typical, and the source letters 
occur in w in exactly the proportions f1,..., fm, then €(w) = []j_, a ss 
— logy €(w) =—-NY, fi logy fi = NH(S). If you don’t mind a certain amount 
of fudging, you could say that this observation establishes that zeroth-order 
arithmetic coding of text from a source S encodes “typical” source text in 
around H(S) bits per source letter. 

Bell, Cleary, Neal, and Witten [8, 84] call —log, €(w) the entropy of the 
message w, and they and Langdon and Rissanen [43,44] interpret it as the num- 
ber of bits that “ought” to be allocated to the encoding of w, if the given relative 
source frequencies are correct. 


3. It may be worth emphasizing again that in Theorem 6.2.1 there are no 
assumptions about the nature of the source, beyond the existence of the rela- 
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tive source letter frequencies f|,..., fm and f(i,...,in), 1 <i1,...,in <m. 
In particular, the source is not assumed to be a perfect zeroth-order source, 
emitting the source letters randomly and independently. The information given 
about the source (i.e., the relative source letter frequencies) is zeroth-order, but 
the source itself need not be. 


4. Restated in statistical shopkeeper’s terms, Theorem 6.2.1 says that dfwld 
arithmetic coding uses, on average, no more than H(S)+1/N bits per source 
letter, in the encoding of source words of length N. Now as has been men- 
tioned from time to time in the information theory part of this book (see Chap- 
ter 2, the discussion following Theorem 4.3.7, and Section 4.6), the source 
entropy H(S) bears the interpretation of being the average amount of infor- 
mation carried by an individual source letter in the source text, and the use of 
the base 2 in the logarithm involved in H(S) means that information is being 
measured in bits. That is, the average source letter in the source text carries 
H(S) = 0", fj logg(1/f;) bits of information. 

Therefore, it should be impossible for any lossless (zeroth-order statistical) 
method of encoding to encode source text using, on the average, less than H (S) 
bits per source letter. The content of Theorem 6.2.1 is the often repeated mantra 
that arithmetic coding is optimal among lossless zeroth-order statistical coding 
methods, that it takes you as close to the shrieking limit of compression the- 
oretically achievable by such methods as you could wish. (In fact, the words 
“zeroth-order” often are omitted from this mantra.) 

Let us digress briefly into controversy concerning those words “should be 
impossible” in the paragraph above. It is a widely held belief that “should be” 
can be confidently replaced by “is” in that statement (although, as noted, it 
would be reckless to omit “zeroth-order statistical’). This belief arises from 
faith in Shannon’s quantification of information, which treats information as 
an incompressible fluid; you can compress text by squeezing out redundancy 
and unused space between nuggets of information, but you cannot (so the belief 
says) put a certain amount of information into a container (a code representa- 
tion) that is too small for that amount of information, without losing informa- 
tion. Try pouring 12 ounces of your favorite liquid into a glass that holds only 
10 ounces, and the intuitive idea becomes clear. 

We are believers, also, but would like to point out that, so far as we know, 
it has never been rigorously proven that there cannot be a source S and a bi- 
nary zeroth-order lossless statistical encoding method which encodes text from 
this source in fewer than H(S) bits per source letter. For one thing, there is 
the problem of saying exactly what we mean by a zeroth-order statistical en- 
coding method. Notice that the Noiseless Source Coding Theorem (5.4.3 and 
also 7.2.1) establishes the result for a particular class of such methods, namely, 
replacement via encoding scheme. Bell, Cleary, and Witten, in their canonical 
classic on lossless methods [8], attempt (p. 47) a proof of the Noiseless Source 
Coding Theorem in apparently greater generality, but their proof may suffer 
from some difficulties, located near the beginning of the attempt. In fact, Exer- 
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cise 6.2.1 at the end of this section shows that what they seem to aim to prove 
there cannot be proven because it is not true—although this assertion can be de- 
bated on the grounds that some fudging is possible in the understanding, in Bell, 
Cleary, and Witten’s proof, of what it means for a “set of codes” to represent 
“messages” (i.e., source words), and on the admission below of a hidden cost 
that is left out of account in Exercise 6.2.1. Perhaps a more devastating objec- 
tion to their attempted proof, and a test for all such attempted proofs, is provided 
by the observation that it is quite normal for higher order Huffman encoding to 
encode in significantly fewer bits per source letter than H (S) (see Section 7.1). 
Therefore, a proof based on assumptions about the encoding method that does 
not rule out higher order Huffman encoding must be fallacious. 

Without giving away any answers, we note that it appears that Exercise 
6.2.1 suggests that it is possible, in a certain extreme case, to encode source 
words of length N by dfwld arithmetic coding at an average cost per source 
letter of slightly less (where “slightly less” is a function of N) than H(S) bits 
per source letter. This is enough to shoot down certain rash statements about the 
entropy being an absolute lower bound on the “average code word length,” or at 
least to make us skeptical of such statements; but the result of the exercise, and 
of the preceding theorem, leave out of account a certain hidden cost of arith- 
metic coding as described in Section 6.1, namely, the necessity of supplying to 
the decoder the length N of the source word. This will require 1 + |log, N bits, 
and will therefore increase the average number of bits per source letter by about 
(1+ log, N)/N. When you throw this into the average computed in Exercise 
6.2.1, “slightly less” than H (S$) becomes “slightly more.” 

In Section 6.4 we will consider an algorithm for arithmetic coding that sim- 
ulates pure dfwld encoding without passing the length N of the source word en- 
coded. However, this algorithm operates with a new cost, an extraneous source 
symbol EOF for “end of file,’ to be used once at the end of the source text. 
This symbol has to be assigned a relative frequency, and the other source letters 
have to have their relative frequencies trimmed. We haven’t done the arithmetic, 
but surely this new device costs enough to keep the average number of bits per 
source letter above the entropy. We are Shannonite believers, you see, but we 
like to keep track of what has been proven and what has not. 


Exercises 6.2 


1. Suppose that m = 2” for some positive integer L, and that fj = 2+, j= 
1,...,m. Note that, by Theorem 5.4.3, since the f; are integral powers of 
1/2, and since Huffman’s algorithm always gives the smallest @ that can 
be achieved with a prefix code, in this case the 2 produced by Huffman’s 
algorithm will be H, the source entropy. 


(a) Compute #7. If the source letters s1,..., 52: are really the binary words 
of length L, what is the compression ratio achieved by Huffman’s algorithm 
in this case? 
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(b) Suppose that the source is a perfect zeroth-order source, meaning 
that f(@i,..-,in) = Live = 2-N¥ for all positive integers N and 
1<ij,...,in <m, with f(@i,...,ix) denoting, as in the text, the relative 
frequency of the source word sj, --- sj, among all source words of length 
N. Recall that H(S") = NH(S) in this case. 


Therefore, this assumption implies that encoding S% by Huffman’s algo- 
rithm will result in no compression at all by (a). The average code word 
length achieved by applying Huffman’s algorithm to S% will be NL; thus 
the average number of bits per original source letter in the encoding will be 
L. 


Show that the average length of a code word representing a source word 

of length N, derived by the arithmetic coding method of Section 6.1, is 

NL—1+2-“4-). (Get started by noting that the dfwld in A(i,..., 

in) is the left hand-endpoint. These left-hand endpoints are Shr: O<j< 

2NL _ 1. You will probably need to know that ea j2)-*=(k-1)2* +1, 

k =1,2,---. You can prove this by induction; or, differentiate both sides 
xktl_y 


of Lior! = +— and plug in x = 2.) 


Thus, arithmetic coding appears to beat Huffman in this case. Not by much, 
but then it is a severely intractable case. However, the appearance is decep- 
tive because, as noted in the remarks at the end of this section, you have to 
pass the length N of the source text to the decoder along with the code text, 
and that adds around log, N bits to the total code package; this is not much 
compared with NL, but it is much bigger than 1 — 2~(N4-), 


2. Suppose that s1,...,5 are binary words with the SPP and, in the class 
of files to be parsed by the s;, the relative frequencies of 51,...,5m in 
the resulting source text are fi,..., fm, respectively. Suppose that L 
= viet Fj \gth(s;). Show that the typical compression ratio achieved in 
arithmetically encoding original files that translate into source words of 
length N is no less than L/[H(S)+ N7!(2+log, N)]. 


6.3 What’s bad about dfwid coding and some ways to 
fix it 
As promised, we look at some of the impractical features of pure dfwld arith- 


metic coding and make some suggestions. In Section 6.4 all of these impracti- 
calities will be overcome, with the sacrifice of a certain purity. 
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(i 


6.3.1 Supplying the source word length 


In the method of Section 6.1, the encoder must supply the decoder with the 
length N of the source text, but how is this to be done? If the encoder sends the 
binary representation of N at the beginning of the code stream, how is the de- 
coder to know when that representation is finished and the code proper begins? 
Some possibilities: 

1. If it is known that there will never be more than M source letters in 
any source text to be dealt with, then you can reserve [log, M| + 1 bits at the 
beginning of the code text for transmitting N. If no bound on the source text 
length is known, you can still use this device by choosing M reasonably large 
and reserving log, M| + 2 bits at the beginning of the code text; the last bit of 
these is a “warning bit” which, if set at 1, warns the decoder that the expression 
for N has overflowed the allotted space and will continue into the next block 
of log, M| + 2 bits equipped with its warning bit; and so on. The code finally 
commences when the last warning bit is zero. 

The disadvantage of this solution to the problem of supplying WN is that it 
compounds the problem discussed in 6.3.3, below. Presumably the encoder will 
keep a count of the source letters while encoding. If the encoder is to convey 
the number N of source letters at the beginning of the code text, then there 
will be a great delay; the decoder will not even get a peek at the partial code 
word supplied by rescaling until the encoding is complete. This is all right in 
those leisurely situations in which the encoded, compressed text is to be stored 
away and decompressed later, but the other kind of situation is encountered with 
increasing frequency. 

We could convey N or a running count of the source letters encoded in some 
other location outside the main code stream. However, providing companion 
locations or parallel streams is inconvenient precisely in those situations when 
we are in a hurry and hope to decode on the heels of encoding. We will have 
more to say about this in 6.3.3. 

2. The method of Exercise 6.1.2 smoothly communicates the source word 
length by adding a certain number of zeros onto the code word. The method 
imposes an extra burden of computation on both the encoder and the decoder, 
but this disadvantage is not as important as the fact that this method appears 
to be incompatible with solutions to the problem addressed in 6.3.3; how can 
decoding proceed on the heels of encoding if the decoder does not know whether 
a string of zeros in the code stream is part of the regular code word or part of the 
extra zeros at the end? This problem could be dealt with by providing a marker 
of, or a pointer to, the end of the regular code word, but this would again raise 
the technical difficulty of supplying an extra location or stream of information 
outside the main code stream. 

3. The algorithm of Section 6.4, yet to come, eliminates the necessity of 
counting the source letters, at the cost of introducing an extra source letter, 
usually called EOF, for “end of file’ This extra letter will be used once, to 
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mark the end of the source text. 

The disadvantage in this trick is that compression will be somewhat less 
than optimal, not so much because of the extra bits required for EOF as because 
the original relative source frequencies will have to be trimmed a bit to make 
room for the small relative frequency to be assigned to EOF. However, the 
entropy of the modified source can be made as close as desired to the original 
entropy by making fror sufficiently small (because x logy x is a continuous 
function of x > 0 and xlog,x — 0 as x | 0). Therefore, by analysis similar 
to that in the proof of Theorem 6.2.1 and the remarks following, you can get 
within a cat’s whisker of optimal lossless compression, on the average, for long 
source texts, using EOF and the algorithm of Section 6.4. 


6.3.2 Computation 


The arithmetic coding method of Section 6.1 requires exact computations, both 
in encoding and decoding. These are costly, especially the multiplications. The 
length of A(ij,...,in) is Te fi; which on the average requires around N 
times the number of bits to store (never mind compute) as the average number 
of bits per number required to store the (rational) numbers f{,..., fin- 

Rescaling and the underflow expansion may appear to relieve the burden of 
exact computation. However, note that these operations involve multiplying the 
interval lengths by powers of two. Therefore, odd factors of the denominators 
of the f; are never reduced by rescaling, and, if bigger than one, will cause 
the complexity of and storage space required for exact computations to grow 
inexorably, approximately linearly with the number of source letters. This ob- 
servation inspires the first of three suggestions for lessening the burden of exact 
computation. 


Replace the f; by approximations which are dyadic fractions. For exam- 
ple, ie 4 and fi = 4, fp= 3, f3=.2,and fy =.1, you could take f; = 122, 
p= x, f= a , and fa= = = , these being the closest (by most definitions of 
closeness) dyadic fractions with common denominator 256 to the actual values 
of fi, f2, f3, and f4. It may appear that replacing the f; in this case by these 
approximations will actually increase the burden of computation, because the 
approximations are nastier-looking fractions than the original f;, and this is in- 
deed a consideration; we could make life easier if we replace the f; by dyadic 
fraction approximations with denominator 8 or 16—but then our approxima- 
tions would not be very close to the true relative frequencies, and that might 
affect compression deleteriously. [It can be shown, by analysis similar to that 
in the proof of Theorem 6.2.1, that as approximate relative frequencies tend to 
the true relative frequencies, the average number of bits per source letter in code 
resulting from arithmetic coding of source words of length N, using the approx- 
imate relative frequencies, will eventually be bounded above by H(S) + is, 
for any € > 0, where H(S) is the true source entropy. Thus good appro 
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tions ensure good approaches to optimal lossless encoding; however, not much 
is known about the penalty to be paid for bad approximations. ] 


Encode blocks of source letters of a certain fixed length. After each block 
is encoded, the encoder starts over on the next block. The computational ad- 
vantage is that the denominators of the rational numbers that give the interval 
endpoints and lengths cannot grow without bound, even if the relative source 
frequencies are not dyadic fractions, since the calculations start over periodi- 
cally.? 

But how will the decoder know where the code for one block ends and the 
code for the next block begins? If an efficient method of providing non-binary 
markers or a parallel pointer/counter stream outside the main code stream is 
ever devised, this might be a good place to use it. In the absence of any such 
technical convenience, we could use a modification of the method of Section 
6.4, with the artificial source letter EOB for “end of block” to be inserted by 
the encoder into the source text at the end of each block. Of course, this device 
costs something in diminished compression. The longer the blocks, the less the 
cost of EOB, but the greater the cost of computation. 


Use approximate arithmetic. This third suggestion for avoiding computa- 
tional arthritis in arithmetic coding is the method actually used in the proposed 
implementation in Section 6.4. The interval [0, 1) is replaced by an “interval” of 
consecutive integers, {0,..., M@ — 1}, which we will continue to denote [0, M), 
and in subdivisions of this interval the source words are allocated blocks of 
consecutive integers approximately as they would be allocated in pure dfwld 
arithmetic coding using the full interval of real numbers from 0 to M. Thus, if 
fh=4, fro=.3, fg =.2, fa =.1, and M = 16, then A(1) = A(s1) = {0, 1, 2,3, 
4,5} = [0,6), A(2) = A(s2) = {6,7, 8,9, 10} = [6, 11), etc. Are you worried 
that subsequent subdivision will shrink the intervals to lengths less than one so 
that they may fail to contain any integers at all? Well may you worry! This un- 
pleasant possibility is taken care of by starting with M sufficiently large, with 
respect to the relative source frequencies, and by rescaling and applying the 
underflow expansion. 

This trick solves the problem of exact computation by simply doing away 
with exact computation. The disadvantage lies in the level of compression 
achievable. This disadvantage has been considered in [8, 33, 48], but there is 
room for further analysis. Some experimental results comparing pure dfwld 


3You might observe that block arithmetic encoding amounts to using an encoding scheme for 
S', where N is the length of the source blocks. This encoding scheme definitely does not satisfy 
the prefix condition; for instance, the single digit 1 is the code word representative of some member 
of S%, and is also the first digit of the code representatives of a great many others. 

However, the luxury of instantaneous decoding available with a prefix-condition encoding 
scheme for SV is illusory. If |S| is fairly large, say |S| = 256, and N is fairly hefty, say N = 10, 
then an encoding scheme for S™ would have a huge number, [sy a Is|%, of lines; 25610 — 280 
is an unmanageable number of registers necessary to store an encoding scheme. So a nice prefix- 
condition scheme for SY is of no practical value in any case. You can think of the decoding process 
in block arithmetic coding as a clever and relatively efficient way of looking up code words in an 
encoding scheme without having actually to store the scheme. 
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arithmetic encoding involving exact computation with slapdash integer-interval 
methods of the type in Section 6.4 appears in [51]. 

Note that if M is a multiple, or, better yet, a power, of the common denom- 
inator of f{,..., fm, then interval subdivision of the blocks of integers substi- 
tuting for real intervals in this form of arithmetic coding is sometimes exact. In 
practice, M is usually taken to be a power of 2, M = 2*. It might be a shrewd 
move in these cases to replace the f; by dyadic fraction approximations with 
common denominator 2*, with k being an integer divisor of K. (Of course, it 
is a luxury to know the f; beforehand. In adaptive arithmetic coding, to be 
described in Chapter 8, the f; are changing as the source is processed, and it is 
not convenient to repeatedly replace them by approximations.) 


6.3.3 Must decoding wait until encoding is completed? 


According to the description of dfwld arithmetic encoding in Section 6.1, in or- 
der that a source text be encoded and the code word for it subsequently decoded, 
the encoder has to read through the entire source text and compute the dfwld r 
associated with the source text, and the decoder has to wait until encoding is 
completed before beginning to decode. 

Compare this train of events with the corresponding operation in the case of 
replacement encoding, in which each occurrence of a source letter is replaced by 
a binary word. In all forms of such encoding (including the adaptive varieties, 
which will be described in Chapter 8), encoding can begin as soon as scanning 
of the source text begins, and decoding can begin as soon as the beginnings of 
the code text are supplied to the decoder. Clearly there are situations in which 
it is highly desirable, or even indispensable, that encoding not wait upon the 
reading of the entire source text, nor decoding upon the delivery of the entire 
code text. 

Can this apparent disadvantage of dfwld arithmetic coding be overcome? 
Well, yes; for instance, one could limit the delays by resorting to encoding of 
blocks of source letters of a pre-set length, as discussed above, at the cost of 
providing a parallel counter or pointer stream, or an extra source letter, EOB, 
which would be discarded by the decoder. 

Is there any way to take advantage of the partial code words, prefixes of the 
final code word, supplied to the decoder by the encoder through rescaling? Yes: 
as noted at the end of Section 6.1.1, the decoder can deduce what the source 
text is right up to and including the last letter processed by the encoder if the 
decoder is supplied with the code text shifted out by rescaling together with the 
number of source letters processed so far. In Exercise 6.1.3 you were asked to 
struggle with the deduction process. Let us look now at what the decoder has to 
do. 

Suppose that the code text supplied by rescaling after the scanning of the 
prefix W of the source text, say with N letters, is a binary word u. Recall 
from the introductory discussion of rescaling that u consists of the part of the 
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initial segments of the binary expansions of the endpoints of the interval A(W) 
where those binary expansions agree. [In case A(W) = [1 —e, 1), this statement 
holds true if you take | = (.11---)2.] That is, the binary expansion of the lower 
endpoint of A(W) looks like (.w0---)2, and the binary expansion of the upper 
endpoint is (.u1---)2. Therefore, provided no interval endpoints are dyadic 
fractions, (.u1)2 is the dfwld in A(W)! That is, given uw and N, the decoder 
needs only to tack | on the end of u and decode normally. 

Thus, in Exercise 6.1.3, in which the relative source frequencies make it 
unlikely that any interval endpoint other than 0 or | is a dyadic fraction, you 
take 010, tack on a | to get 0101, and decode normally by any of the methods of 
Section 6.1.1. [Note r = (.0101)2=5/16 and N =7.] You should get acdcaca. 
You can check that this is correct by encoding acdcaca, with rescaling, to see 
if 010 is the partial code word provided by rescaling. 

The problem of dyadic fraction interval endpoints is a nuisance, but can 
be overcome. As mentioned in the footnote on page 145, this would not be a 
problem if we made our intervals open on the left, closed on the right, and that is 
one way out. Even with intervals closed on the left, note that if (.w1)2 is not the 
dfwld in A(W), then either (.w1)2 is in A(W), in which case normal decoding 
of u1 gives the source word W, or (.u1)2 is the upper endpoint of A(W). This 
second possibility can be checked for by the decoder, and adjustments can be 
made. 

Although we will not dwell upon it here, this process of decoding from 
knowing WN and the partial code supplied by rescaling can be adapted so that it 
proceeds right “on the heels” of encoding, with the decoder’s rescaling sweep- 
ing away old code and keeping the eager decoder one source letter behind the 
encoder. 

The great impediment to our happiness with this method of eager decoding 
is the necessity of supplying the source letter count N corresponding to the par- 
tial code supplied by rescaling. If N is to be conveyed by some pointer/counter 
stream parallel to the regular code stream, compression is seriously reduced. 
Perhaps there are situations in which code can be delivered to the decoder in 
conformity with a certain rhythm, so that the decoder gets N by some sort of 
timing device; barring some such trick, this sort of decoding on the heels of 
encoding appears infeasible. 

Another, more promising, path to allowing decoding to follow soon upon 
the start of encoding arises from the observation that in the decoding of Section 
6.1, we need only decide in which of several large intervals (r — a)/£ lies. We 
usually do not need to know (r — a) /£ exactly, which means that we usually do 
not need to know r exactly. 

What would happen if we replaced r by the approximation 7 obtained by 
truncating the binary expansion of r somewhere — i.e., if we tried to proceed 
using just the (current) first few bits of the (current) code stream? The ap- 
proximation 7 will be a little less than r, so F — a)/€ will be a little less than 
(r—a)/€. If the latter is exactly equal to the lower endpoint of the interval 
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A(sg) = [ee fi. ae fj) in which it lies (so that we ought to decode sx) 
then we are doomed: (fF —a)/£ will be in the next interval down, we will 
wrongly decode sz,_1, the next “current interval” will be wrong, and we will 
be in a world of trouble. The same catastrophe will occur if (r —a@)/£ is not 
equal to the lower endpoint of A(s;), but is very close to it, and F is not close 
enough to r to put (F —a)/£ in A(sx). 

These catastrophes could be avoided at some cost in compression if we 
roughened the arithmetic coding process by putting some space — an “error 
zone” — between the intervals into which the current interval is subdivided. That 
is, the initial intervals A(s,),..., A(Sm) would not cover [0, 1), and subsequent 
subdivisions would be similar to the first. Then we can proceed to decode fear- 
lessly, replacing r by 7, provided we have figured out how far to take the binary 
expansion of r, to obtain F, so as to ensure that whenever (r — a) /€ is in A(sx), 
1 <k<™m, then ( —a)/é will be greater than the upper endpoint of A(s,_1). 

This roughening is somehow, happily, built into the algorithm of Section 
6.4, but not explicitly. The algorithm is a discrete simulation of pure dfwld 
arithmetic coding which corrects all the defects of the pure process that we 
have discussed here, at a controllable cost. 


Exercises 6.3 


1. Suppose that S = {a,b,c,d}, and fa = .35, fh = 3, fo =.25, and fa = 
.1, as in Exercise 6.1.1. Find the dyadic fractions ie fee fs and be with 
common denominator 16, adding up to 1, such that ( das fo. im fa) is as 
close as possible to (fa, fp, fc, fa). (Take “as close as possible” to mean 
that > | f, — fy| is minimized.) 

ses 

Redo Exercise 6.1.1 using the ie as the relative source frequencies. Does 
rescaling do much to curb the growth of the denominators of the interval 
lengths? 

2. S={a,b,c,d}, fa=.4, fo =.3, fo =.2, and fy = .1. The encoder, rescal- 
ing whenever possible, passes to the decoder the following information, 
one line at a time (A stands for the empty string): 


Number of source New bits added to the 
letters processed | code stream by rescaling 


Decode on the run, on the heels of the encoding, as best you can. (Note 
that the code string, with N = 7, stands at 101010110, so you can always 
check your work by decoding 1010101101, with N = 7.) 


© 2003 by CRC Press LLC 


6.4 Implementing arithmetic coding 167 


——————_ eee 


6.4 Implementing arithmetic coding 


In this section, some of the practical considerations for implementing arith- 
metic coding are examined. As with all the probabilistic methods presented, 
any model producing symbol probabilities can be used with arithmetic coding. 
The examples in this section use the simplest case of a fixed order-0 model. 

So far, arithmetic coding has been presented with the understanding that 
the encoded stream is determined after the entire input stream is examined, al- 
though rescaling may produce a good part of the code as the source is processed. 
In practice, it is generally not feasible to maintain the precision required to com- 
pute the interval corresponding to the entire source stream, even with rescal- 
ing. In addition, in many applications transmission must begin before the entire 
stream has been coded. 

Typically, arithmetic will be of limited precision; in fact, an approximate 
version of dfwld arithmetic coding can be implemented entirely with integer 
arithmetic (32 bits of precision is common). On many machines, integer opera- 
tions are much faster than floating-point, and, in addition, portability consider- 
ations are simpler. 

The scheme of this section will use the rescaling and incremental transmis- 
sion described earlier: as soon as a digit in the binary representation of the final 
interval is determined, send that bit out as part of the encoded stream and then 
expand the interval. In order to prevent the current interval from becoming too 
small, the underflow expansion will be done in the case that the current interval 
is short but includes 1/2 as an interior point. 

The decoder will need a method to determine when all symbols have been 
recovered. If the source length cannot be provided “up front,’ then another 
method will be needed to terminate decoding. One possibility is to enlarge 
the symbol set S by adding a special end-of-file symbol, denoted by EOF. Of 
course, enlarging S has a price: the new symbol requires code space. In practice, 
this may be quite small; see Section 6.3 and also [34] for some discussion. 

The algorithm that we will describe here for practical arithmetic coding is 
due to Witten, Neal, and Cleary [84]; in their honor, we will call it the WNC 
algorithm, for short. It is best thought of as a simulation of pure dfwld arithmetic 
encoding and decoding. 

The main feature of the WNC algorithm is that the interval [0, 1) will be 
replaced by a finite set of consecutive integers, {0,..., M— 1}, to be denoted 
[0, M). The choice of M is critical, and will be discussed later. In practice, M 
is always a power of 2, but there is no harm in leaving M unspecified, in what 
follows. 

The WNC encoding algorithm follows dfwld encoding exactly, with the 
reservations that computations are replaced by “integer arithmetic”, best ex- 
plained by example below; the rescaling and underflow expansions are obliga- 
tory, not optional; and the finish of the encoding process is not just “add | for 
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1/2, the dfwld in the final interval’”—there is more to it than that. 

The parallels and differences between the two processes are given in Table 
6.1. In this table, [L, H) ={L,..., H — 1} will be the “current interval” in WNC 
decoding, where L and H are integers. As before, the “current interval” in dfwld 
encoding will be denoted [a, 8). The operation “round down” will be denoted 
[-|. Both dfwld and WNC assume an ordered source alphabet 5, ..., 5, with 
positive relative frequencies f|,..., fin, usually (but not necessarily) in non- 
increasing order. In the WNC algorithm, one of the s;, usually s,,,is EOF, with 
a small putative relative frequency obtained by taxing the relative frequencies 
of the real source letters. 


Table 6.1: Encoding comparison between dfwld and WNC methods. 


Dfwld WNC 


Starting — 
[0, 1) [0,M) = {0,....M—1} 


Newcutent see One APSO) || Lib (On gD) 


endings, | Bet nice B-e) | HL + Die fH-D) 


Rescaling | When f < 1/2: When H < M/2: 
a<2a,B <— 26. L<2L,H <—2H. 
When 1/2 <a: When M/2 < L: 
a<2a-1,6B<2p-1. L<2L—M,H <—2H-M. 


Underflow | When 1/4 <a <1/2 < 6B <3/4:|When M/4<L<M/2<H< 
expansion a <— 2a—1/2, 3M /4: 
B <— 2g -1/2. L<—2L—(|M/2], 


H<2H-([M/2\. 
(Usually M is even, so |M/2| 
M/2.) 
Ending At the end, carry out rescaling un- | At the end, having read EOF, carry 
encoding | tila < 1/2 < , then add 1 to the } out rescaling and the underflow ex- 
code stream pansion until neither can be car- 
ried out. At this point, either L < 
M/4<M/2<HorM/4<L< 
M/2 <3M/4 < H. In the former 
case add 01**! to the code stream, 
where k is the underflow count; in 
the latter case add 10‘+! to the 
code stream. 


6.4.1 Example (WNC encoding) (a) S = {a,b, EOF}, fa = 6/10, f, = 3/10, 
Seor = 1/10; M = 16. We encode abaEOF, using a table similar to those in 
Section 6.1. 
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Next letter 


or rescale New Underflow 
or underflow L H code count 


ee es ee es es 
a ee O+LC616J=9 | | 0 


“tela. 


16 

sit oe ee mee 
x—>2x-8 14 
14 


Thus the code for abaEOF is 01101001. The last “O01” is added because the 
final interval is [0, 16); L =0 < M/4=4 < M/2=8 < H = 16. See the last 
part of Table 6.1. 


(b) S= {a,b,c, EOF}, fa =.4, fp =.3, fo =.2, and fror = .1. This time, 
we take M = 32. We encode bacbEOF. 


Next letter 


or rescale New Underflow 
or underflow L H code count 


x —>2x-16 
a 8+ [(20).4] = 
xX —> 2x 32 
x —> 2x —32 32 


8+ [(.4)24] =17 | 84.724] = 24 
2 
4 


The output code is: 011110011110. 


6.4.2 The WNC algorithm for encoding An ordered source alphabet s),..., 
Sm, including a special symbol EOF, and corresponding relative frequencies 
fi,---> fm are given. Also, a (large) positive integer M has been chosen. The 
following applies to encoding a source word in which EOF occurs once, at the 
end. For j=1,...,m+1, let Fj = )0j<; fi- 
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1. A current interval [L, H) is initialized as [0, M) = {0,..., M—1} and 
maintained at each step. Also, an underflow count is initialized at 0 and main- 
tained to the end of the file. 

2. (Underflow condition) If the current interval [L, H) satisfies M/4 < L < 
M/2 < H <3M/4, replace the current interval by [2L — |M/2|,2H — |M/2]) 
and add 1 to the underflow count. [If M is even, the round down signs may be 
deleted from the preceding. ] 

3. (Shift condition) If the current interval [L, H) satisfies H < M/2, re- 
place the current interval by [2L,2#) and output 01*, k = underflow count, to 
the code stream. If M/2 < 1, replace the current interval by [2L -M,2H — M) 
and output 10* to the code stream. In either case, reset the underflow count to 
0. 

4. If none of the conditions in 2 or 3 hold, look at the next source letter 
(indicated by a pointer). If it is s;, assign L < L+|Fj;(H—L)|,H —L+ 
|Fj41(H — L)| and move the pointer forward, unless s; = EOF. 

5. Repeat 2-4 until EOF has been encountered and none of the conditions 
in 2 or 3 hold. If, at this point, L < M/4 < M/2 < H, output 01‘*! to the code 
stream, where k is the underflow count. Otherwise, output 10‘+!. The encoding 
is now finished. 


Decoding WNC decoding differs significantly from pure dfwld decoding in 
that the decoder does not use the entire code stream to decode, but rather just the 
(current) first NV = [log, M] bits of the code stream. These appear in a register 
called v (for value), and change as decoding proceeds, as code bits from the 
right are shifted into v and (one of the first two) code bits on the left (start) of 
v are deleted. (We will make the process clear below.) It is this dependence on 
only the first few bits of the code stream, in decoding, together with the use of 
integer arithmetic, which makes WNC encoding and decoding so practical and 
fast. 

The decoder tracks the encoding process. This means that a current interval 
[L, H) is maintained and whenever any of the conditions in 2 and 3 of the 
algorithm in 6.4.2 hold, the appropriate expansion brings about a shift into and 
out of v. It is not necessary to keep an underflow count, but the underflow 
expansion brings about an unusual shift into the v register: 


AN+14N42°°* 7 |@103...GNQN+] |AN42...- 


That is, the second bit in v is deleted and a new code bit is shifted in. Mean- 
while, the rescaling expansions in 3 of 6.4.2 both bring about 


AN+14N+42°°* 7 |€2..-ANQN+] |AN+42..-- 


(In all of the above, aja2...anan+1an-+2 are the bits of the current code stream 
and the box represents the v register.) 

Decoding of source letters occurs when, finally, no condition in 2 or 3 of 
6.4.2 holds. In order to decode, the different subintervals [L;, H;) of [L, H) 
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corresponding to the source letters 51, ..., 5; have to be computed, and v has to 
be viewed as the binary expansion of a non-negative integer; whichever interval 
[L;, H;) the value v falls in determines which s; is next decoded, and [L;, H;) 
replaces [L, H) as the current interval. When EOF is decoded, decoding stops. 

One very slight inconvenience of the use of integer arithmetic and the round 
down operation is that it is not determinable invariably which interval |L;, H;) 
v lies in simply by calculating i) . We could be off by one interval if we try 
to use the method of 6.1. There are ways of avoiding the full burden (see 
the discussion at the end of “Implementation and performance issues”, below), 
but, in what follows, we will calculate the subintervals [L;, H;) corresponding 
to the letters s;, 7 = 1,...,m, whenever the time to decode has arrived. 


oF A» 


6.4.3 Example (decoding) (a) S = {a,b, EOF}, fa = 6/10, fp =3/10, fror 
= 1/10, M = 16, and the code is 01101001; N = 4. 


Decode 
LH [La, Ha) (Lp, Hp) a or... 


Pomo=e OTe] 0.9) 19.14) __[14,16) 


0110 [0, 5) [5, 8) [8, 9) 
0110 x2 os 
1101 f ie x > 2x—-16 


(1010) = 10 161[4,1) [11.14 (14,16) 
1010 ll —— = = 28 


UIC) OC EEA 


The decoded source message is: abaEOF. 


(b) S = {a,b,c, EOF}, fa = 4, fo = 3, fo = .2, feor = .1, M = 32, and 
the code is 011110011110. [NV =5]. 


Decode 
5 Hs [La, Ha) (Lp, Hp) (Lc, Hc) (Leor. HEOF) or 
(O1111)9 =15 32 | (0,12) [12,22) [22,28) [28, 32) 
01111 : 22 cst 


(01110). = 14] 8 | 28] (8,16) [16,22) [22,26) (26, 28) 
01110 8 | 16 x= oe 
11100 16 | 32 x — 2x —32 
0 | 32 


(11001)9 = 25 [0,12) [12,22) [22,28)  [28, 32) 
11001 22 | 28 peat 
10011 ue 24 ei 


(10111)y = 23 32} [8,17) [17,24) [24,29) (29, 32) 
10111 ; 24 ee 
01111 16 x — 2x 


(11110)7 = 30 Enea [4,15) [15,23) [23,29)  [29,32) | EOF | 


Decoded source message: bacbEOF. We leave the formulation of the WNC 
decoding algorithm as an exercise. 
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Implementation and performance issues 


Some issues related to implementation on a machine have been discussed in 
previous sections. In this section, some additional programming considerations 
are examined. The section concludes with a few notes on performance issues. 

The first question regarding WNC encoding/decoding we need to deal with 
is: how large does M have to be, and why? The larger M is, the more accu- 
rate is the simulation of pure dfwld arithmetic coding provided by the WNC 
algorithms (on the grounds that long intervals of consecutive integers are more 
divisible than short ones, and so better simulate the continuum), and thus the 
closer we are to the Holy Grail, encoding losslessly in H(S) bits per source 
letter. On the other hand, letting M be simply enormous is computationally 
impractical. 

There is a more prosaic consideration concerning the size of M, other than 
compression performance: during encoding, when the current interval [L, H) 
is subdivided into subintervals [L;,H;), j = 1,...,m, it must never happen 
that L; = H; for some j. For if L; = H; then [L;, H;) is empty, and if the 
next letter is s;, the encoding process will proceed to crash, or enter an infinite 
rescaling loop. 

Given S = {51,..., Sm} and positive relative frequencies f|,..., fim, let Fj = 
yy fi, j=1,...,m-+1. The disaster we have to avoid is L+ | Fj;(H—L)| = 
L+|Fj+1(H —L)],1¢., |F} HW —L)| = |Fj+1(H — L)], for some 7, when we 
have a current interval [L, H) satisfying none of the conditions in 2 or 3 of 6.4.2. 
(That is, the time has come to read the next source letter and replace [L, H) by 
the subinterval [L;, H;) corresponding to that source letter.) 

Notice that Fin41 = 1 > Fn =1—-— fin, and H — L is a positive integer, so 
LFin(H — L)| < H-L=|Fn+4i1(H —L)]. Therefore, the disaster we fear will 
never occur with j =m. Therefore, to avoid calamity it suffices that fj (H — 
L) => 1 (why?) for j = 1,...,m—1. 

If none of the conditions in 2 and 3 of 6.4.2 holds, then either L < M/4 < 
M/2 < H or L < M/2 <3M/4 < H. Let us suppose that M is divisible by 
4. Then we see that if the current interval [L, H) is about to be subdivided, it 
must be that H-— L > M/4+2. Putting this together with the condition in the 
paragraph preceding we obtain the following. 


6.4.4 Given S = {s1,...,5m} and positive relative frequencies f\,..., fm, let 
Smin = min[fi,..., fn—1]. The WNC encoding algorithm, applied to source 
text over S, will not crash due to an empty interval if the integer M is divisible 
by 4 and M > (4/fmin) — 8. 


In the situation of 6.4.1(a), for example, fmin = .3 and we could have pro- 
ceeded with M = 8. In 6.4.1(b), fmin = .2 and we could have proceeded with 
M = 12. Of course, the compression achievable with these smaller values of M 
will probably not match that achievable with the larger values, 16 and 32, over 
the long haul, but it might be interesting to experiment and find out roughly how 
much we lose with the smaller values. 
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The test of sufficiency of M given in 6.4.4 becomes important in adaptive 
arithmetic coding, in which the relative frequencies vary—see Section 8.3. In 
every case it is good policy to make EOF the last in the order of source letters, 
with the smallest relative frequency. 

As mentioned earlier, the arithmetic is easiest to follow in the case that the 
interval length M satisfies M = 2" for some N €N, which is a natural choice 
for implementing on a machine. The values for the interval (and the value of 
v used in decoding) can be maintained in N bits. In addition, a number of the 
steps in the scheme can be managed as simple bitwise operations. 

To be precise, if the current interval is to be maintained in N bits, then the 
register for the right endpoint will contain H — 1 rather than H. It turns out that 
this is the natural choice if bitwise operations are to be used. The shift condition 
in step 3 of the algorithm 6.4.2 is a simple test on the leftmost (called the most 
significant bit or MSB) of the N bits of L and H — 1: if MSB(L) = MSB(# — 1) 
then shift these registers left, sending the MSB to the output and giving new 
values for L and H —1. The shift on L doubles the value represented in the 
lower N — | bits, which is what is desired. However, the shift on the right 
endpoint is performed on H — 1| and hence | must be added to the result. 

The underflow condition in step 2 of the algorithm also corresponds to 
simple bitwise operations. Underflow occurs when the two leftmost bits of L 
and H — | are ‘01’ and ‘10’, respectively. In the expansion, the second bit is 
deleted in each of L and H — 1, the last N — 2 bits are shifted one space left, a 
0 is the new last bit of the LZ register, and a 1 is the new last bit of H — 1. (Also, 
the underflow count is incremented.) 

An example is the best way to understand the process. The encoding of 
6.4.1(a) is repeated in Table 6.2. The format of the table has changed somewhat, 
in order to better illustrate the bitwise operations. The calculations haven’t 
changed, but the current interval is maintained in N-bit registers as L and H —1, 
with the understanding that this corresponds to the interval [L, H). The left end- 
point of the corresponding subinterval for each symbol is listed as Lg, Ly, and 
Leor, respectively. The result, of course, is the same as before. The algorithm 
gives ‘01101001’ as the encoded stream. 

For illustration, the table lists all of the left endpoints L,, Lp, and Leor. 
This is more arithmetic than is required and would be expensive if S has many 
symbols. Only the endpoints corresponding to the current input symbol are 
needed (and can be calculated from number 4 of 6.4.2). 

In decoding, only the subinterval corresponding to the value is needed. In 
the decoding examples we calculated all the subintervals [L;, H;) corresponding 
to different letters, when it came time to decode, but there is a way to avoid this 
calculation. Suppose that the relative frequencies f; are rational numbers, say 
fi =ci/C, cj and C positive integers, i = 1,...,m. [We use the letters c and C 
here to suggest the word “counts”, in anticipation of adaptive arithmetic coding. 
See Chapter 8.] Let Co = 0 and C; = jis i=1,...,m. Thus C,, = C and 
ci /C = Fj41,i =0,...,m, with the F; as defined previously. Now the amount 
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Table 6.2: The encoding of Example 6.4.1 (a). 


Current interval 
Symbol L H-1 La Lp LEOF Output 
stat 0000 1111 0000 ~~ 1001 1101 

a 0000 =1000 §=0000-~—— 0101 1000 

b 0101 = O111 expand® x +> 2x 0 
1010. 1111 ~— expand? x +s 2(x — M/2) 1 
0100 =1111 0100 =§ 1011 1110 

a Q100 1010 expand’ xh» 2(xn—M/4) — underflow 
0000 =1101 0000 ~~ 1000 1100 


EOF 1100 1101 — expand? x 2(x— M/2) 10 
1000 1011 ~— expand> x ++ 2(x — M/2) 1 
0000 = 0111 expand® x +> 2x 0 
00001111 


4Using notation from the C programming language, the expansion is L < 1 
and (H—-1) «1/1. 

The expansion is the same as above, but the leftmost bit must be discarded. 
©This can be written as the bitwise operations (L & (M/4—1)) < 1 and 
((H — 1) 4 (3M/4)) «1| 1. 


of arithmetic can be minimized by scaling the value v back to the subintervals 
[C;—1, C;) in order to find the current output symbol. To see how this works, 
consider the stage in decoding where L < v < H and we wish to find i so that 
Li < v < H; (giving output symbol s;). From the formulas in 6.4.2, this means 


Ci- Ci 
[Fu-b| <v-L<|a-D| 


and hence 

Ci-1(H —L) < (v-L+)NC<C(A-L). 
Every term is an integer, and it follows that 
(v-L+1)C-1 

H-L 

The steps can be reversed, showing that the scaled value w satisfies Cj;_1 < w < 
C; if and only if L; < v < Lj41. It is w (along with the cumulative counts, in 
the case of adaptive coding; see Section 8.3) which can be used by the decoder 
to find the current symbol. 

As an example of the use of w to find the current symbol, consider the 
second step in the decoding example in 6.4.3 (a) where v = 6 and the current 
interval is [L, H) =[0,9). We can take C; = 6, C2 = 9, and C = C3 = 10. Our 
calculation gives 


(v-L+1)C-1 (6—0+1)10-1 69 
peg el peg - og 


Cin | ie es (6.1) 
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Since C} =6< w=7 <9= (Cy, it follows that the current symbol is sz = b. At 
this stage, the current interval corresponding to “b’ must be calculated and then 
decoding continues. 


Precision 


If, for some reason, we require the M in the WNC algorithm to be small, we 
may allow rough and arbitrary approximation of the relative source frequencies. 
If we allow the fmin in 6.4.4 to be as large as 1/m = 1/|S|, then, by the analysis 
preceding 6.4.4, the rather minimal condition |S| < M/4+2 and M divisible 
by 4 will guarantee that the algorithm can proceed. This will not leave much 
room to accurately reflect probabilities, however, and in practice M may be 
much larger than 4|5|. Performance considerations will place an upper bound 
on the number of bits which can be required for the calculations. Letting f; = 
ras as above, i = 1,...,m, clearly fmin > 1/C and therefore the subintervals 
[L;, Hj) will be nonempty if C < M/4+2. With some programming care, the 
intermediate calculations can be done if CM < 2”, where p is the number of 
bits of precision. Hence, if M = 2”, it suffices to require that C < 2°, where 


logy |S|<c<m-—2 and c+m< p. (6.2) 


Today, p = 32 is common, and m = 16 and c = 14 may be a natural choice. 
For large symbol sets, conditions (6.2) could be rather unpleasant even with 
larger p; see Moffat, Neal, and Witten [48] for an improved coder which is 
more flexible. 

As an alternate viewpoint (especially in the case that p is small), one could 
consider that m and c are given, and then choose S appropriately. Finally, note 
that the number of underflow bits is not bounded by the algorithm. However, 
since any pending underflow bits are all the same, only a count need be main- 
tained. 


Performance 


For a given model, Huffman coding is “best possible” among probabilistic 
methods which replace source symbols by an integral number of bits. However, 
it is not optimal in the sense of “entropy.” As the simplest example, consider 
an alphabet with two symbols. Regardless of the probabilities, Huffman will 
assign a single bit to each of the symbols, giving no compression. 

On the other hand, arithmetic coding is optimal, and it can do better than 
Huffman. In particular, arithmetic coding on the two-symbol alphabet can yield 
compression, in contrast to Huffman. It is important to note that implementa- 
tions of arithmetic coding (such as that presented here) will be somewhat less 
than optimal, due to the integer arithmetic and other compromises. Since Huff- 
man is nearly optimal in many cases [23], the choice between Huffman and 
arithmetic is not as simple as the theory might suggest, although it appears safe 
to say that, in practice, arithmetic coding usually gives better compression. 
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Implementing arithmetic coding is not much more difficult than Huffman 
coding, but execution speed has been a serious concern. In the case of fixed 
probabilities (static or semi-static coding), Huffman will be significantly faster. 
In the adaptive case, however, maintaining the Huffman codes is expensive 
in time and memory. An optimized arithmetic coding implementation can be 
faster, use less memory, and give better compression [48-51]. 

There has been considerable work to reduce the number of multiplications 
required. Using approximate probabilities can permit replacement of multipli- 
cations by simple shift operations. The QM-coder (a binary arithmetic coder) 
used in JPEG image compression schemes [57] is an attempt to maximize per- 
formance with such methods. 

JPEG provides an example of another consideration in choosing a com- 
pression method. As discussed in Chapter 10, lossy JPEG schemes get most of 
their compression using a transform method, and then Huffman or arithmetic 
coding is used on the output. The QM-coder used in the arithmetic mode may 
be covered by patent, according to the Independent JPEG Group (IJG) [25]. The 
free JPEG software implements the Huffman portion of the specification. Arith- 
metic coding may offer some additional compression, but the IJG writes “Since 
arithmetic coding provides only a marginal gain over the unpatented Huffman 
mode, it is unlikely that very many [JPEG] implementations will support it.” 


Exercises 6.4 
1. A stream from S = {a,b,c, EOF,d} is to be encoded using the statistics 
fa = 3/7 and 1/7 = fp = fe = fror = fa. 


(a) Encode ‘db’, followed by EOF, using (integer) arithmetic coding with 
M =2' (ie., the registers for L and H — 1 are 4-bit). Keep the symbols 
in the order listed when assigning subintervals. 


(b) Given that the encoder and decoder have agreed on the algorithm, what 
information must be passed in order for the decoder to recover the 
source string? 

(c) Decode the result of (a), showing the details. Minimize the arithmetic 
by using 6.1 to find the symbol. (Let C = 7.) 


2. It is sometimes possible to reduce the size of the output file at the last stage 
of the encoding process. Consider encoding b followed by EOF, using the 
arrangements from Exercise 1. 


(a) Show that the algorithm gives ‘1000001’. 

(b) Let’s try to save a few bits at the last stage. Assume that the decoder 
understands that any encoded string ends with an infinite number of 
zeros. In our problem, the final interval includes 0, and if there are 
no underflow bits pending, then two bits could be saved at the final 
stage. In every case (even if there are underflow bits), sending a single 
1 suffices. (Why?) 
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Suppose the algorithm is modified so that the encoder never outputs 
until it has a maximum number of equal consecutive bits (and the de- 
coder understands that the encoded string ends with all zeros). Show 
that these changes give ‘1’ as the encoded string for this example. Are 
there any “problems” with these changes? 


3. In Exercise 1, show that it is not possible to scale the symbol counts and 
choose c to satisfy 6.2. However, after expansions, the current interval 
[L, H) is always wide enough to assign symbols to distinct subintervals. 


4. Verify that the bitwise operations in the footnotes to the table on page 174 
agree with the expansions described in the algorithm. 


5. For a given S, the minimal condition |S| < M/4+2 guarantees that the 
algorithm can proceed. Name two important advantages in choosing M 
larger than this lower bound. 


6. Langdon and Rissanen [44] describe a method called bit stuffing to han- 
dle the “carry-over problem” (which corresponds to the underflow case 
discussed in this chapter). The current interval length is maintained in a 
fixed-width register, but normalized so that it represents lengths in [1/2, 1); 
i.e., the register is shifted so that the (implied) binary point is followed by 
a1). The code string is shifted by the same amount in order to maintain 
alignment. 


Table 6.3 illustrates the process with a binary arithmetic code.+ The source 
alphabet is S = {0, 1}, and the first portion of the string to be encoded is 
‘0100010’. The probabilities depend on the context s, and P(1 | s) denotes 
the probability of 1 in context s. The symbol ‘0’ is “more probable” in this 
example, and is assigned to the left subinterval at each stage. Following 
the notation of [44], the code string at each stage is C = C(s), and the 
(normalized) length is contained in A = A(s). It is understood that the 
length A(s1) is obtained by truncating the value P(1 | s)A(s), and then 
A(s0) = A(s) — A(s1).° 


The lines containing “shift” indicate the number of left shifts to be per- 
formed on A and C so that the most significant digit of A is 1 (and the 
result is shown in the following line of the table). This allows the working 
part of C to be maintained in a fixed-length register (4 bits in the table), but 
additions can cause carry-over into the digits preceding the binary point. 
In the example, the last 0 input leads to a string of ones in C. If the next 
input symbol is 1, then a carry-over may propagate up this long string of 
1s, converting each to 0, and terminating only when it reaches a 0. 


4See also Langdon’s paper [43] for additional notes and extended examples of a similar algo- 
rithm. 

5In this example, the probabilities are of the form 1/2/, and so the calculation is particularly 
simple (and amounts to shifting A(s) right by j bits). These probabilities may be approximations to 
the actual values and result in a multiplication-free scheme. See [44] for a discussion on how these 
estimates should be chosen. 
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Table 6.3: Binary arithmetic coding with bit stuffing. 
strings input. P(|s) Cc) As) A(s0) AGH 


null 0 1/22 .0000 1 1100 ~=—.0100 
0 1 1/2 0000 =.1100 =.0110 ~—.0110 
01 0110 = .0110 shift 1 

01 0 1/22 0.1100 .1100 .1001 = .0011 
010 0 1/22 0.1100 .1001 0111 .0010 
0100 0.1100 0111 shift 1 
0100 0 Lj? 01.1000 .1110 .1101 = .0001 
01000 1 1/2 01.1000 .1101 0111 .0110 
010001 01.1111  .0110 shift 1 
010001 0 1/2 011.1110 .1100 .0110  .0110 
0100010 011.1110 .0110 © shift 1, bit stuff 
0100010 01110.1100 = .1100 


In order to limit the number of bits that can be affected by carry-over, an 
extra 0 is “stuffed” into C (shown in the last line of the table). This extra 
zero blocks the propagation of a carry: the digits which appear to the left of 
this stuffed bit no longer participate in the arithmetic, and can be sent out. 


(a) Assume that the probability at the last line of the table is P(1 | s) = 
1/2. If the next input symbol is ‘1’, show that the carry propagates 
into the stuffed bit. 

(b) The string s = 01000101 (i.e., the stream in the table along with the 
digit from (a)) is encoded. Recall that any number in [C(s), C(s) + 
A(s)) determines a valid encoded stream. Show that ‘01111011’ is the 
best representative.° 


The decoder must also manage the stuffed bit. After receiving a predeter- 
mined number of consecutive Is (three, in this example), the decoder ex- 
amines the next bit (the stuffed bit). If the bit is 0, then no carry at the next 
stage has propagated into the current digits, and the stuffed bit is ignored. 
If the bit is 1, then it is added to the current value of the codeword. 


(c) Show the decoding details for the encoded stream ‘01111011’. 
(d) What is the cost of bit stuffing? 


6Langdon and Rissanen do not include the leading 0 (apparently since the decoder can manage 
this case); it has been retained here since it simplifies the discussion. 
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6.5 Notes 


A complete implementation of the scheme by Witten, Neal, and Cleary appears 
in their well-known paper [84]. This article is the basis for corresponding ma- 
terial in Text Compression [8]. Appendix C gives addresses where the original 
and optimized versions may be found. A separate implementation of the same 
coding scheme can be found in the book by Nelson and Gailly [53]. Portabil- 
ity has been a goal in this code, but it should be noted that the sources make a 
few optimizations that assume certain widths on data types. Moffat, Neal, and 
Witten [48] provide a number of improvements to the earlier version. 

The QM-coder used in JPEG is a descendent of the IBM Q-coder, devel- 
oped out of work on compressing bilevel images. A description of the Q-coder 
can be found in a series of articles in the IBM Journal of Research and Devel- 
opment, November 1988. Rabbani and Jones [59] present a short section on the 
Q-coder, and [57] contains a discussion of the QM-coder. 

Binary coders (such as the Q-coder) are an important special case in arith- 
metic coding. Alphabets with more than two symbols can be managed by en- 
coding the current bit according to a suitable context, although performance 
may be unsatisfactory [34,49, 50]. Howard and Vitter [34,35] discuss mod- 
eling and coding methods to improve the speed while preserving most of the 
compression. 

Ross Williams’ thesis [82] contains a lengthy survey of text compression. 
The section on arithmetic coding is short, but includes an interesting alternate 
view of the scheme and a few notes on the basic ideas in the Q-coder devel- 
opment. The internet newsgroups comp.compression (established by Williams 
in 1991) and comp.compression.research can be good sources of information, 
although, like many newsgroups, there is a considerable amount of noise to fil- 
ter. Jean-loup Gailly coordinates the FAQ (Frequently Asked Questions) for 
these newsgroups, which is a good source for introductory material, pointers to 
source code, references, and other information (see Appendix C). 
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Higher-order Modeling 


In understanding probabilistic or statistical coding methods, it is useful to think 
of the coding process or apparatus as divided into two autonomous packages, 
the coder and the model. The model, or statistical processor, passes information 
about the statistical nature of the source text to the coder, which then uses this 
information to encode the source text efficiently. 

Chapters 5 and 6 were about different kinds of coder, and the model was 
very rudimentary; supposedly a statistical study of the source text was con- 
ducted before encoding to estimate the relative source frequencies, f|,..., fim, 
which are supplied to the coder once and for all (for that source text). In this 
chapter and the next we will look at two different kinds of statistical proces- 
sor more complicated than the plain old non-adaptive, zeroth-order model pre- 
sumed heretofore. In Chapter 8, we take up adaptive methods. Here, we study 
higher-order non-adaptive methods. As noted in Chapter 8, the two sorts of 
model can be crossed to produce hybrid processors, higher-order adaptive mod- 
els. 

All of these different models can be used with either the Huffman or the 
arithmetic coder (or with other statistical coders, such as those based on Shan- 
non’s or Fano’s methods). As long as the coder knows the current values of 
fi;---, fm, the coder knows how to process the source text. In principle, the 
coder need not be tailored to fit the statistical processor. In practice, it may 
increase the efficiency if the coder is modified to mix better with the model, 
in the necessary exchange of information between them. We will present the 
higher-order models (and, in the next chapter, the adaptive statistical processor) 
in alliance with the Huffman and arithmetic coders, and speak as though there 
were such things as “kth-order Huffman encoding,” or “adaptive arithmetic en- 
coding,” because it is convenient to do so, and because we think it might be 
easier for somebody learning about these models for the first time to do so in 
connection with the coders, so that they can see how the full package works. 
But we wish to emphasize, for reasons of academic purity, that, as the title of 
this chapter indicates, “higher-order” is a quality of the model, or statistical 
processor, and can be considered separately from any particular coding method. 
Perhaps it is time to reveal what higher-order modeling is all about. 


Suppose S = {51,..., 5m}, a source alphabet, and an integer k > 0 are given. 

In kth-order encoding, we assume that the relative source frequencies f (i1,..., 

ig41) of the words s;, --- 5;,,,, among all source words of length k + 1, are given. 
181 
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In the cases k = 0,1,2, we write f(i) = fi (the relative frequency of s; much 
used in the preceding chapters), f(i, 7) = fj, the so-called digram frequency 
of sjs; and f (i, j,k) = fijx, the trigram frequency of 8; 8; Sx. 

If the relative frequencies f(i1,...,ix41) are given, then so are all the rel- 
ative frequencies f(ij,...,i1), | <i1,...,i; <m, 1<t<k+1. For instance, 
fGi,.--.ik) = aes f(@i,.--,ix, J). Notice that, for instance, when k = 1, 
Lfij] = F, anm x m matrix of non-negative numbers, is an acceptable matrix of 
digram frequencies if and only if for each, the ith row sum and the ith column 
sum of F are equal, and 1 = )7; >; fij- 


7.1 Higher-order Huffman encoding 


One way to use the hard-won knowledge of the relative frequencies f(ij,..., 
ix+1) would be to treat S*+! as the source alphabet and to produce an encoding 
scheme using Huffman’s algorithm. This encoding scheme would have m‘+! 
lines. 

In kth-order Huffman encoding, k > 1, we have, instead of one big scheme, 
rather a lot of little schemes, m* of them, in fact, each with m lines, so the 
total hidden cost of kth-order encoding is about the same as that of zeroth-order 
encoding using the huge source alphabet S*+!. Let us call each source word 


Si, °*+ Si, Of length k a kth-order context. For each such context, and 1 < j <m, 
let 
fF Wiscastis J) 
P(s; | Si +++ Si) = —— 
F Citss20ath) 
the conditional probability that, if you have just scanned the word sj, --- sj, in 


the source text, the next letter will be s;. The m* encoding schemes come 
about by applying Huffman’s algorithm to S = {s1,...,5m} equipped with the 
conditional relative frequencies P(s, | Sj, --+Si,),---, P(8m | Si, +++ Si, ), for each 
context s;, ---S;,. Thus there is one scheme per context, which makes m* of 
them, and each is an encoding scheme for S, and so has m lines. 

Once you have all these schemes, how do you encode source text? Each 
occurrence of the letter s; is encoded with the code word for s; in the scheme 
associated with the context s;, ---s;,, the k-letter word immediately preceding 
that occurrence of s;. Thus different occurrences of s; may well be encoded 
differently. How, then, will the decoder be able to recognize the code for that 
occurrence of s;, following s;, ---s;,? Very simple: the decoder has decoded 
the code text preceding the code for that occurrence of s;, so the decoder knows 
that it is “in context s;, ---s;,”; the decoder proceeds to scan the code text with 
reference to the encoding scheme associated with the correct context. 

The discerning reader will have detected that there is a problem with those 
first k letters in the source text, which are not preceded by a k-letter context. 
No problem—decide on some prefix-condition “starter scheme” for S and use 
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it for those first k letters. (Of course, the decoder will have to be told what the 
starter scheme is.) It seems reasonable to use the Huffman scheme based on the 
relative frequencies f|,..., fm of the source letters, calculable as follows: 


tier -Sio- Pesan: 
1<i,...,ig<m 
7.1.1 Example k= 1, S= {s1, 52,53, S54}, and 


16 .10 .10 .04 


YE D=Vil= "4 or on 04 


02 02 .05 Ol 
(How are these f;; found? Sampling. But these particular fj; were just made 


up.) We find (how?) that the relative source frequencies of s51,52,53,54 are 
fi=4, fo =.3, fg =.2, f4 = .1, so we take the following as starter scheme: 


5320, s2310, 833110, sgo111. 


Now we compute the context schemes. For context s;, we are supposed to assign 
to s1,...,54 the conditional relative frequencies fj1/fj,..., fia/ fi; since these 
are proportional to fi1,..., fi4, we use these to form the Huffman tree. Simi- 
larly, in general, in kth-order encoding, the Huffman tree for context s;, «++ sj, is 
formable with the assignment of the f(i1,...,ix, j) to the s;; it is not necessary 
to compute P(s; | si, --- Si) = fli, ....tk, I/F GU, «ik. 


Context 51: 
0 
Ss. (16 (4) 
1 
52 (10 “(24) This gives the same 
1 


scheme as the starter 
0 
53 (. (14) scheme. 
i 


10 
« @ 


Context 59: 
0 
51 (08 
4 Scheme 
$2. (17 s, > 00 
it 
52> 1 
53 (04 “(.05) 53 — 010 
1 54 —> O11 


6 @ 


(Of course, a different labeling of the edges will give a different scheme, 
but with the same code word lengths.) 
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Context 53: 


Scheme 
S| > 0 
52 — 100 
s3 > 101 
s4—> 11 


Scheme 
S|, 7 00 
s2—> 010 
53> 1 
s4—> O11 


So, for instance, with the four context schemes and the starter scheme at our 
disposal, the source text s2515152515453535, is encoded 10000100011111010. 
Check the encoding, and also check that the decoder can recover the source 
string from the code string, if supplied with the starter and the context schemes. 


7.1.2 Computing the compression ratio Again, S = {51,..., 5} and the rel- 
ative frequencies f(i,,...,ix41) of the words in S‘+! are given. For a context 
Si, ++ * Si, (assuming k > 1) and 1 < j <m, let €(i,...,i%, j) be the length of the 
code word for s; in the encoding scheme for the context. The average length 
of a code word replacing a source letter (neglecting the starter scheme, the ef- 
fect of which would be negligible with a large source text) is, by elementary 
considerations (see Section 1.8) 


m 
Lo = SS” Pio PGs eOatad 
j=! 


1<ij,...,i¢<m 


= FG Gitte): (Verify!) 


1<i, er a <m 


Thus, for instance, in Example 7.1.1, we have 


@, N=] = 


NrRN Re 
WWre NY 
Re WW Ww 
WN WwW WwW 


and (1) — Yi<i,j<4 fijlij = 1.72. (Verify!) By comparison, applying Huff- 
man’s algorithm to S, with f| = .4, fo=.3, f3 = .2, f4 =.1, gives 2=(O= 
1.9. If you apply Huffman’s algorithm to S?, with s;s j assigned relative fre- 
quency fjj, you will get €= €(§?) = 3.53. Thus the compression ratio 
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achievable by this method, | 2L/3.53, assuming the s j themselves are binary 
words with average length L, is less than that achieved by first-order encoding, 
L/1.72. 


Here is an academic question of practical importance. Given f(i1,..., 
inst), 1<i1,...,ik41 <m, let 2 be as defined above, and let £(skt+1) be the 
average code word length achieved by Huffman’s algorithm applied to S*+! as 
the source alphabet equipped with the relative frequencies f(i1,...,ix41). Is 
it always the case that 2 < core (S*+!)? In other words, is the compression 
achieved by kth-order Huffman encoding always at least as good as the com- 
pression achieved by zeroth-order encoding, treating S**! as the source alpha- 
bet? (Notice that when k = 0, these two are the same.) It is somewhat surprising 
that the answer is: not always. See Exercise 7.1.4. Notice that the situation in 
that exercise is rather extreme. The next question is: under what conditions do 
we have 0) < perl (si)? It is a large question that probably does not have 
a snappy answer given the current state of our knowledge and terminology, but 
its obvious practical importance makes it worth looking into. 

Here is another question of practical importance: in case k > 1, is it neces- 
sarily the case that 2 < @4—!)? Or, can increasing the order sometimes give 
you worse compression? We suspect that 0“ < ¢“— always holds, but we 
have no proof. 


Exercises 7.1 


1. Suppose S= {s1, 52,83, 54}, with sS,= 000, 2 001, §s3= 01, s4 = 1, and 
digram frequencies f(i, 7) = fij given in 


2 04 .06 .05 
05 17 .08 .02 
[fj = 07 08 .06 .02 
03.03.0301 


(a) Find the single-letter relative frequencies /|, fo, f3, f4, and the com- 
pression ratio achieved if Huffman’s algorithm is applied to S. 


(b) Find the compression ratio achieved if Huffman’s algorithm is applied 
to S? (with the relative frequencies fij given above, of course). 


(c) Give the four context schemes for first-order encoding of this source 
and encode the source string s2525153818151535253535154. (Use the 
scheme associated with (a) for the first letter, s2. There are differ- 
ent correct schemes for the starter and the contexts, so, if doing this 
exercise as part of a problem set, clearly label your schemes.) 

(d) Find the compression ratio achieved by first-order encoding of this 
source alphabet. (This does not mean the compression ratio achieved 
in part (c) on that small segment of source text, but in general, on the 
average, on very large “typical” blocks of source text.) 
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2. Suppose someone were to examine the source text of problem | and to dis- 
cover the single-letter source frequencies, f|, fo, 3, and f4, but to remain 
ignorant of the digram frequencies fj;. Suppose this person applies Huff- 
man’s algorithm to S”, assuming the relative frequency of s;s j among all 
two-letter source words to be fj fj. 


a) What compression ratio would this person believe they have achieved, 
p Pp y 
given their assumption about the digram frequencies? 


(b) What compression ratio would they actually have achieved? 


3. The lazy but earnest person of problem 2 also tries first-order Huffman 
encoding of the source text of problem 1, again assuming that the relative 
frequency of s;s; is fi fj. 


(a) What compression ratio does the encoder believe has been achieved by 
this method? 


(b) What is the actual compression ratio achieved? 


4. Let S = {s1,52,53} and suppose the digram frequencies are given by 


7.05 05 
[Lfij]=|.02 .04 .04 
08 .01 01 


Recalling the notation of this section, compute LO @) and & (S 2) for this 
source alphabet. Observe that ¢“) > @($*)/2. 


7.2 The Shannon bound for higher-order encoding 


Again, suppose that k > | and the “(k+ 1)-gram” frequencies f(i1,...,ix41) 
are given. Recall that the k-gram frequencies are then known: /f(ij,...,i%) = 
pa fgseite = Sie FCs it, --+5tk)- 

In the preceding section we looked at kth-order Huffman encoding. Clearly 
other kth-order replacement scheme strategies are possible; you need only sup- 


ply a prefix-condition scheme for encoding $1, ..., Sm for each context sj, --- 8;,. 
Let, for some such association of schemes to contexts, €(i1,...,i%, 7) be the 
length of the code word for s; in the scheme corresponding to context §;, --- Siz, 
and 
m 
l(i,...,iK) = So PCs; | si, -++ 54, €G1,--.,48 J) 
j=l 
vf Gayest J) 3 eeu 
= ea 
jal FS G,.--, tk) 
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Then the average length of a code word replacing a source letter, using whatever 
our context schemes are, is 


e= > F Giese tp CGis secs tb) 


1<iq,...,i¢<m 
= So) FG oie eGa sie). 
TSiy,...,ik41 <m 


as in the preceding section, where the method was kth-order Huffman encoding. 

Since Huffman’s algorithm gives the minimal £ (i,,...,7%) for each context 
Sj°*+Sj,, among prefix-condition schemes associable to that context, it follows 
from the preceding that , the value of @ for kth-order Huffman encoding, will 
be the smallest kth-order @ achievable. Therefore, in thinking about bounds on 
compression achievable with kth-order replacement schemes, we may as well 
stick with kth-order Huffman encoding. Henceforward, €(i1,...,i%, j) and eh) 
will be as in Section 7.1. 

Let 


m 
H(i1,....ik) =— ) > P(s; | Sig +++ Sig) logy Ps; | 8iy + Sig) 
j=l 


Sf Citas oo Feat) 

a ee ee 

pare AG UEEELL2) F@ive25 th) 

the “entropy of the source in context s;, ---s;,.” We define the kth-order entropy 
of S = {s1,..., 5m} to be 


HO OSHS So F GG st Chessy) 
1<ij,...,i¢<m 


Plugging the full gory expression for H (ij, ..., i) into the expression for H“?, 
thrashing about and doing what comes naturally with logarithms, one finds that 


H®(s) = H(S*+} — Ac), 
where 


H(S*)=- So fi... step log, fli... tet) 


1<ij,..., iggy <m 


is the plain old zeroth-order entropy of S**!, 
We have, for each context sj, --- Siz, 


H(iq,..., ik) < &,...,i) < H,....i) +1, 


by the Noiseless Coding Theorem, plus the fact that the average code word 
length obtainable by Huffman’s algorithm is the best (smallest) obtainable with 
a prefix code. 
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7.2.1 Theorem The average code word length (“)(S) achieved by k th-order 
Huffman encoding applied to a source alphabet S satisfies H“(S) < ¢“)(S) < 
H®(S)+1. 


Proof: By the preceding remarks, 


HOO) = So PGs h Gscesty) 


1<i1,...,ip<m 

Bo LF Cisse Cicste) =O) 
1<i1,...,ip<m 

< fe. DEG 4+D 
1<i1,...,ip<m 

=H®(S)\+ Yo fli) = HOO)+EL oO 


ip yetk 


7.2.2 Corollary If the s; are binary words with average length L, then the 
compression ratio L/ (8) achieved by kth-order Huffman encoding applied 
to S satisfies L/(H(S) +1) < L/@® <L/H™. 


As mentioned in the last section, we do not know whether or not £ always 
decreases as k increases. If this were the case, then increasing the order repays 
your effort with a better compression ratio. However, when m = 256, as is often 
the case, it is a lot of trouble to increase the order, and actual case studies with 
k =0, 1, 2,3 show a discouragingly small improvement in the compression ratio 
going from k = 1 to k = 2, and a minuscule improvement obtainable by taking 
k= 3) 

This sort of experimental observation agrees with the behavior predicted by 
the theory developed by Claude Shannon [63,65]. In practice it is impossible to 
let k get very large, much less go to infinity. And, in fact, there is a theoretical 
obstacle to letting k go to infinity: we would have to have an infinitely long 
source text, given our notion of how the relative frequencies f(i1,...,ik+1) 
are obtained. Shannon gets around this difficulty by envisioning “the source” 
as a probabilistic finite state automaton, a system of states; as time pulses on 
discretely, the current state changes (or not) at each pulse, and source letters are 
emitted. What the next state will be and which letter is emitted are both random 
variables depending on the current state—that is, the different possibilities have 
their probabilities, and those probabilities vary with the current state. Thus 
there is a hypothetically endless string of source letters emitted, with statistical 
properties, including the probabilities f (i), ...,iz+1), foreach k, determined by 
the nature of the source automaton. 

Is every source “language” correctly (whatever that means) modeled by 
some such source automaton? This is a far deeper question than we will ever an- 
swer, although we will have a bit more to say about it in Section 7.4. For now, let 
us assume that our source is one of these Shannon automata. Shannon showed 
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that the kth-order entropies H “© tend to a limit, let us call it H ), which Shan- 
non called the entropy of the source. Thus the Shannon bound L/H“) on the 
compression ratio achieved with kth-order Huffman encoding tends to a limit 
L/H©) if H‘©°) > 0. Consequently, when H ‘°° > 0, the compression ratio 
L/¢™ cannot be increased without bound by taking k larger and larger. The ex- 
perimental case studies mentioned above, with £2) not much smaller than 2“ 
and 2°) very close to 2), are very much in accord with the picture suggested 
by Shannon’s results and Theorem 7.2.1 of the compression ratio coming to a 
screeching halt at some unbreachable limit, as k increases. 

This is an instance of difficult mathematics confirming intuition. If we re- 
quire lossless compression, meaning that the original file shall always be com- 
pletely recoverable from its encoded version, then surely there should be some 
natural limits, depending on the nature of the original file, to how much com- 
pression can be realized. However, it is important to realize that the Shannon 
bounds on the compression ratio, of the form L/H™, k = 0,1,...,00, apply 
to the replacement-by-encoding-scheme methods discussed in this chapter. As 
we have seen in the last chapter, these bounds can be beaten by other methods 
in some cases. So the natural bound to the compression ratio, even given a 
Shannon automaton-type source, may not be the Shannon bound L/H®). 

One last remark about the Shannon bounds L/H“: Shannon asserts, but 
does not show, that the H“ non-increase with k in case the source is a proba- 
bilistic finite state automaton. Therefore, the Shannon bounds on the compres- 
sion ratio, L/H, are going in the right direction (up!) as k increases, even 
though we do not know about the actual compression ratios, L/2“), achievable 
by kth-order Huffman encoding. 

We finish this section with an elementary verification of Shannon’s asser- 
tion about the monotonicity of the H“, without assuming anything about the 
nature of the source. 


7.2.3 Theorem Suppose that k > 1 and the (k+1)-gram frequencies f (i1,..., 
inst), 1 <i1,...,i¢41 <m, for anm-letter source S are known. Then 

Hs) < H&S). 
Proof: Observe that h(x) = —x log, x has negative second derivative on (0, oo) 
and so is concave on [0, 00). (Note: (0) = 0, by convention.) Therefore, for 
A1,...,A- = O with Ai = 1, and x,...,x, > 0, >; Ajh(xj) < h(Q>; Aixi)- 
Therefore, 


ay fGiewateD 
BPS Se Peni Yn (Ree?) 


EAS | 
1<i,...,ig<m j=l k) 


as en fof Cis oD) 
© Ere aon Gt) 


Sahai 
1Sip,...,i,j<m ij=1 tk) 


II 
~ 
= 
Hm 
=> 
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Sig,..-,ik, jm 


IA 


Sig,..,ik, jm 


- z 


Sig,.,ik, jm 


- > 


ST 105 %k SM 


(When k = 1, the sums over /2,..., 


> 7 Gosise: 
ye f (in, 


Sf (ia,..- 


f(,.-- 
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f@,... tk) » (ee) 
DDD eee i) FGste) 


f@Gi,---.iks dD 
in Fla, a) 


. | 
’ h{ ——— 
se ( f (i2,..., ik) 


f(r1,.+-51k) ) (k-1) 
Tr-1)h [| -- =H (S) 
FCiges te) 
ix, j are just sums over j.) O 


7.2.4 Corollary Withk and S as above, 


(k+1)H(S) < H(S*!) < oe 


“tH < <(k+1)H(S). 


The left-hand inequality, and its ee below, are due to Shannon [63]. 


Proof: 


H(Ss*+!) = 


(H(S*t!) = 
+ (H(S*)— 


H(S*)) 
H(S*"!))+..-+(H(S) —0) 


= Hs) + H&-(s)+---+ HS) 
>(k+ DHS), 


by the theorem above. Therefore, also, 


H(S}) > k+1)H®(S) = 


(k+ 1)(H(S*t!)— (S85) 


implies H (S*+!) < oa Hence, 


H(S**}) < ak 


k+1 2 
< —_——.....- A (SS) = 
i i (S) 


Exercises 7.2 


k+l sh < < << 


+1oék 


k-1 
qe) 
k 
a4 (k+ DH(S). Oo 


1. Compute H® and H“ for the source of Exercise 7.1.1, and the Shannon 


bounds L/H® and L/H. 


2. Compute H and H“” for the source of Exercise 7.1.4. 


3. Show that, fork > 1, H(S) = 


tion. 


H (SE!) a H(S*), as asserted in this sec- 


4. Show that if the s; occur randomly and independently in the source text, 


with relative frequencies /\,..., 
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Thatfi;, 1 <i...,i¢ < m, then H®(S) = H(S) = H(S) = 
— Vyj=1 fj logy fj for every k > 0. 

5. Notice that, in the proof of Theorem 7.2.3, h is strictly concave. This im- 
plies that if A1,...,A; > 0, )>A; = 1, and x1,...,x, > 0, then )° Ash (x;) 
< h()_Ajx;) unless all the x; are equal. 


Find a necessary and sufficient condition on the k- and (k + 1)-gram fre- 
quencies for a source S for the equality H ©) (§) = H&—!)(S) to hold. 


6. What does Corollary 7.2.4 say about the Shannon bounds on the compres- 
sion ratios for zeroth-order Huffman replacement, kth-order Huffman re- 
placement, and Huffman replacement using S*+! or S* as the source al- 
phabet? 


ES 


7.3 Higher-order arithmetic coding 


For k a positive integer, kth-order encoding of any sort departs from the “(k + 1)- 
gram” relative frequencies, f(i1,...,ik41), | <i1,...,ik41 <m, where f(i1, 
...,1¢41) is the relative frequency of the source string sj, ---5j,,, among all 
blocks of k + 1 consecutive letters in the source text. Supposing that we know 
what these (k + 1)-gram relative frequencies are, how would we proceed to take 
advantage of this knowledge in arithmetic coding? 

The main idea is that, for t > k, the intervals A(i1,...,i;,j), l1<j<m, 
for the source words sj, ---s;,5;, are obtained by subdividing A(ij,...,i;) 
into subintervals of lengths proportional to the numbers f (i;—441,...,i, J), 
j =1,...,m. The actual probabilities associated to the s; after s;,---s;, are 
SF Gr-ktis- estes D/P Gr—-k41,---, i+), where the k-gram frequencies f(j1,..., 
jx) are given by f(j1,.... i.) = Dea SG i fed) = DR fs fest). 
These probabilities are just constant multiples of the f(;—x41,..., ir, 7) but, 
unlike the case of higher-order Huffman encoding, in which absolute probabil- 
ities are not important, you will have to use these probabilities in calculations. 

For instance, let us return to Example 7.1.1, in which S = {5), 52,53, 54}, 
and the digram frequencies are given by 


16.10 10.04 
$43 08 .17 04 01 
LPG DI=Uil= | 14 or 01.04 


02 02 .05 O01 


The single letter frequencies are f; = .4, fo = .3, f3 =.2, and f4 =.1. We will 
use these to start first-order encoding, so the first intervals are A(1) = [0, .4), 
A(2) = [.4,.7), AG) =[.7, .9), and A(4) = [.9, 1). From there on we have 
context; every letter after the first has a predecessor. So, for instance, A(2, 1) = 
A(s251) = [.4, .48); the length is .08 because the probability of an s; when we 


are in first-order context s2 is & and this is multiplied by the length .3 of the 
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interval A(2). Similarly, A(2, 1,3) = A(s25153) = [.452,.472). (Because we 
are in context s;, you look at the first row of the matrix of digram frequencies. 
The left-hand endpoint of A(2, 1,3) is .4+ 164-10 (08) = .452, and the length 
is (.08)4? = .02.) 

Notice that it is not necessary or advisable to put the source letters in order 
of non-increasing context probability in subdividing successive intervals. For 
instance, in the situation above, A(4,3) may as well be the third interval from 
the left in A(4), not the first, even though /43 is the largest of f41, f42, f43, and 
faa. 

Decoding in the manner of Section 6.1 proceeds as in that section, except 
that, after the first k letters have been decoded, the decoder has to keep account 
of the changing context probabilities. For example, with k = | and the digram 
relative frequencies as above, given 01111 and N = 3 the decoder could proceed 
as follows [r = (.01111)2 = 15/32]: 


Nextletter a a 
Oo 1 15/32 
ky) 4.3 about .23 
S| 4.08 about .86 
53 


Thus, 01111 would be decoded, correctly, as s2s153. (Check that (.01111)2 
is the dfwld in [.452, .472) = A(2, 1,3), found earlier.) Finding aw and @ on 
the third line of the table above, after s; has been decoded, follows the same 
procedure as in encoding. It is notable that in the second line —* ~ .23 is 


compared, not to 0 and .4, but to 0 and 8 x .27, to decode s; on the next line. 


i“ x .86 is compared to 26 = .65 and to = = .9, to 


Similarly, on that line 
decode s3. 

The methods of Section 6.4 can be adapted to higher-order situations—as 
in the examples just exhibited, it is a matter of keeping account of the context 
and adjusting the current source letter probabilities according to the context. 

It is a bit of trouble to take—is it worth it? Zeroth-order dfwld arithmetic 
coding encodes in close to the zeroth-order source entropy bits per source letter; 
will kth-order arithmetic encoding do the job in something like the kth-order 
source entropy bits per letter? (See Section 7.2 for the definition of the kth- 
order source entropy, H (4) (§), and note Theorem 7.2.3 which says that H () (S$) 
is non-increasing with k.) Happily, the answer is yes. 


7.3.1 Theorem The average length of the code representatives of source words 
of length N over a source alphabet S, if the code words are derived by k th-order 
dfwld arithmetic coding, where k < N and the first k letters of each source 
word are processed in zeroth-order fashion, is no greater than 1+ kH(S)+ 
(N—k)H“(S). 


Thus the average number of bits per source letter with kth-order arithmetic 
coding applied to source words of length N is no greater than + + LH (S) + 
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(1— £)H(S)), which is quite close to H(S), for large N. 

Theorem 7.3.1 is proved by the same sort of thrashing around as in the 
proof of Theorem 6.2.1. The main constituents of the proof are the observations 
that the dfwld in an interval of length £ has a binary expansion of no more than 
log,(€~!) + | bits, and that the length of the interval A(s;, ---sj,) derived by 
the kth-order subdivision procedure is 


(Tx, Hote Ff (i2,..-,tk42) fink, ---s in) 
j=1 


FOves.ct fice ee FON ENE) 


We omit the details. See Exercise 2 below. 


Exercises 7.3 


1. Suppose S = {51, 52, 53,54} and the digram frequencies are as in Example 
7.1.1 (and again in this section). 


(a) Encode 52525282, $1 S25354, $4535251, and S251 S454 by first-order dfwld 
arithmetic coding. 

(b) Decode 11, 01001, 10101, and 0101, assuming that the source word 
lengths are all 4, and that the encoding method was first-order dfwld 
arithmetic coding. 


2. Prove Theorem 7.3.1. [The hard part will be the following: there are num- 
bers C(j1,.--, Jkt), 1 < ji,---s Jkt <m, which satisfy 


N-k 
Do PGs vi) DT 8Gr-vtrte) 
1<74 538; in <m r=1 
= SO CG ecieneGnasdea: 
ISjis-sJkp1sin 

where f(ij,...,in) is the relative frequency of sj, ---s;, among source 
words of length N, and g could be anything, but will be given by g(ji,..., 
Jk+1) = logy eS in the proof; the C’s are given by C(ji,..., jx+1) 


Cis JkED 
=(N—-k)f (i1,---, Jk41). Once you see this, the proof is straightforward. ] 


7.4 Statistical models, statistics, and the possibly 
unknowable truth 


Statistical parameters such as the probabilities f(i1,...,ix41) are estimated, in 
this case by taking sample means, through real statistics collected from some 
messy reality and are then used to talk about, or to do calculations concerning, 
that reality. It is always the case that these parameters are used in conjunction 
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with some sort of mental picture of what that messy reality is like. We dignify 
this mental picture with the term “statistical model.” 

In many cases the statistical model need not be spelled out. For instance, 
consider the parameter “average number of children per household in the United 
States.’ What’s the model? There are children, there are households, and 
each child belongs to a household; we know all about this from our daily 
experience—no need to make a fuss about the picture of the reality to which 
the parameter applies or from which it is estimated. 

Here is an example that shows that sometimes a fuss is in order. There is a 
probability distribution called the Poisson distribution which applies to simple 
statistical models concerning the number of occurrences of some specific event 
during specified time intervals. For instance, the Poisson distribution is used to 
talk about the number of cars passing a certain point on a certain road between, 
say, 2 and 3 PM every Tuesday, or the number of cesium atoms, in a certain hunk 
of cesium, that will decay in a month. You need one statistical parameter to use 
the Poisson distribution: it is the average number of occurrences of the event 
during the time interval. Clearly this parameter can be estimated by observation. 

Let us take the time interval to be the 24 hours from midnight to midnight, 
and the event to be: the sun rises. Observation clearly suggests that the average 
number of occurrences of this event per time interval is one. Plugging this 
into the Poisson distribution, one finds that the probability that the sun will not 
rise tomorrow is |/e, the probability that it will rise exactly once is 1/e, and, 
therefore, the probability that it will rise 2 or more times is 1 — (2/e), about 1/4. 

Have we just been lucky all these millenia? How do we resolve the disparity 
between our experience of the sun rising with probabilities calculated using 
the Poisson distribution? The resolution seems clear—shrug and dismiss the 
calculations on the grounds that the rising of the sun every day does not fall into 
the class of phenomena that even approximately conform to a statistical model 
for which the Poisson distribution is valid. We know it doesn’t because the 
conclusions we get from the assumption that it does are absurd. We can leave 
it to the philosophers to sort out the a priori reasons why we should never have 
bothered applying the Poisson distribution to the rising of the sun. 

Unfortunately, in dealing with a source text we are not on such familiar 
ground as we are with the rising of the sun. In attempting the replacement 
encoding that is the subject of this chapter, we estimate certain statistical pa- 
rameters, the relative frequencies f(i1,...,ix41), and proceed essentially on 
the faith that we are taking advantage of a good statistical model of the source 
to achieve compression. The bottom line for us is the compression ratio. 

Whatever the true nature of the source, there is associated to kth-order en- 
coding a particular statistical model of the probabilistic finite state automaton 
type. For k = 0 there is only one state. At each pulse of time a letter is emitted, 
with each s; having probability f; of being emitted, and the system stays in that 
state. 

For k > 1 there is one state for each context. When we are in context 
Si, ***Si,, We are also in that state. During the next pulse of time, a letter s; is 
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emitted with probability P(s; | s;, ---sj,) and we move to state 5;, --- 5;,5;. For 
example, the following state diagram, with the discs representing states and the 
labels on the arrows being the probabilities, depicts a statistical model for the 


\\ P(s1|s1)=7/8 


situation of Exercise 7.1.4: 


1/10 


(We leave it to the reader to ponder whether or not the source diagrammed 
just above really will produce source text exhibiting the digram frequencies in 
Exercise 7.1.4. Since fj; = fi P(s; | si), and the P(s; | s;) are given in the 
diagram, it suffices to verify that the single letter sequences in the source text 
willbe fj = .8, fo = fg =.1.) 

The practical question is, will kth-order Huffman or arithmetic encoding 
achieve good compression? This is not really the same as asking if the kth- 
order model is a good model of the source. For example, if the zeroth-order 
model is a perfect model of the source and the relative source frequencies are 
approximately equal, then no spectacular compression is possible by the meth- 
ods of this chapter, and probably not by any methods. Good compression is 
achievable when the relative frequencies are decidedly unequal. 

But you might feel that the question of “goodness” of the model is impor- 
tant, because you feel that if the kth-order model is “good,” then, say, kth-order 
Huffman encoding, while it may not give very good compression, and even, as 
we have seen in the last chapter, may be slightly less compressive than arith- 
metic coding, will still give almost as good lossless compression as can be had. 
Perhaps this feeling is valid in the majority of cases; we would need some def- 
initions of “goodness,” and “as good as can be had” to analyze the situation. 
Meanwhile, here is an extreme and disturbing example that shows that good- 
ness of the kth-order model does not always give as good compression as can 
be had. 


7.4.1 Example S = {a,b,c,d} and the source text is abcd = abcdabcd::-. 
That is, the source text consists of the word abcd repeated over and over. 

If we knew or noticed that the source text is of this nature, then we could 
achieve great compression by what is called “run-length” encoding. If the par- 
ticular chunk of source text that we want to encode is (abcd), we could leave 
a note saying “repeat abcd N times.” The amount of space such a message 
would occupy would be no greater than constant + log, N bits, so the “local” 
compression ratio achieved, if a, b,c,d are binary words with average length L, 
would be no less than 4NL/(const + logy N) > co as N — ov. (This does not 
beat the Shannon bound L/H®), however, because H ©) = 0 in this case. See 
Exercise 7.4.3.) 
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Suppose we do not notice the nature of the source and attempt kth-order 
Huffman encoding. With k = 0 we have fa = fp = fc = fa = 1/4, and Huff- 
man’s algorithm replaces each source letter with a 2-bit binary word. Let 51,52, 
$3, 54 be auxiliary names for a,b,c,d. When k > 1, only four contexts sj, --- sj, 
have positive probability (1/4 in each case) and for each of these we have 
P(s; | Si, +++ Si,) = 1 for one value of j, and P(s; | s;,---s;,) = 0 for other 
values of 7. You carry out kth-order encoding by ignoring the contexts with 
zero probability. The result is that every source letter gets replaced by one bit. 
Thus the compression ratio achieved is L/1 = L for each k > 1. This is better 
than the zeroth-order compression ratio, but a far cry from what is possible. Yet, 
the first-order model of this source is correct. See the comment in Exercise 2 
below. 


Exercises 7.4 


1. Give the state diagram of the first-order model of the source in Example 
7A. 


2. Give the state diagram of the first-order model of the source in Example 
7.4.1. (Disconcertingly, this first-order model can be regarded as a perfect 
model of the source—it produces exactly the right source string, although 
not necessarily starting with ‘a’. Yet first-order Huffman encoding does 
not achieve compression as good as can be obtained.) 


3. Show that for the source of Example 7.4.1, H(S) = 2 and HS) =0, 
for k > 1. [Thus H‘©)(S) = 0. Perhaps disturbing examples like Example 
7.4.1 are only possible with sources of zero entropy. ] 


4. Generating text according to some statistical model can be a mildly amus- 
ing experiment, and it also gives some insights into the model itself. The 
basic goal in compression is to remove redundancy, so the output of a good 
model/coder is typically rather random. We want to reverse the process, 
sending random data to a specific model to generate text. As Bell, Cleary, 
and Witten [8] remark, the output of this reverse process is a rough indica- 
tion of how well the model captured information about the original source. 


Single-letter frequencies from a 133,000-character sample of English ap- 
pear in Table 4.1. Text generated according to these probabilities is not 
likely to be mistaken for lucid prose, as the model knows very little about 
the structure of English. Higher-order models may do better, where the 
necessary statistics are gathered from a sample of “typical” text. 


As an example, models were built from Kate Chopin’s The Awakening.! 
An order-1 model produced the following text when given random input: 


asthe thetas tol t dinfrer the Yo Do smp thle s slawhee pss, 
tepimuneatage le indave tha cars atuxpre ad merong? d ur atinsinth g 


'The electronic source for the book was obtained from Project Gutenberg, available via 
ftp://uiarchive.cso.uiuc.edu/pub/etext. Thanks to Judith Boss and Michael Hart. 
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teres runs | ie t ther Mrenorend t fff mbendit’sa aldrea ke Shintimal 
*Alesunghed thaf y, He,” ongthagn buid co. fouterokiste singr. fod, 


Moving to higher-order models will capture more of the structure of the 
source. An order-3 model, given the same random data, produced: 


assione mult-walking, hous the bodes, to site scoverselestillier from the 
for might. The eart bruthould Celeter, ange brouse, of him. They was 
made theight opened the of her tunear bathe mid notion habited. Mrs. 
She fun andled sumed a vel even stremoiself the was the looke hang! 


Choose your favorite software tool and write a program that builds an order- 
k (k > 0) model from a given source, and then uses that model to generate 
characters from random input.” 


——————=_ 


7.5 Probabilistic finite state source automata 


The first-order state diagrams introduced in the preceding section are special 
cases (actually, slightly degenerate special cases) of diagrams that we will call 
probabilistic finite state source automata, or pfssa’s, for short. These were intro- 
duced by Shannon in “A mathematical theory of communication” [63], although 
he gave them no special name. They are sources; they produce source text. It 
appears that Shannon entertained the belief that human language, produced by 
a single hypothetical human, could be well approximated — perhaps simulated 
would be a better word — by a large pfssa, or even that human language produc- 
tion could be described exactly by a very large pfssa. 

A pfssa is a finite directed graph (i.e., a diagram consisting of nodes and 
arrows or arcs among the nodes, with each arc having a starting node at one 
end and a not necessarily different finishing node at the arrowhead end) in which 
each arc e is furnished with two labels. One label, g(e), is a positive probability, 
and the other label, s(e), is a letter from a fixed source alphabet S$. The labeling 
must satisfy two requirements: every source letter must appear somewhere, and 
for each node, the sum of the probability labels on the arcs leaving that node is 
one. 


7.5.1 Example The following diagram shows a pfssa, with three nodes, S$), S2, 
S3, over a source alphabet S = {a,b,c,d}. 


2The authors used awk, which offers associative arrays and other magical features. The script 
contained roughly 80 lines of awk code, and required approximately 5 megabytes of memory for 
the order-3 example. 
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The nodes of a pfssa are called states, which is why we named the nodes by 
indexing the letter S. (It is just our bad luck that the words “source” and “state” 
start with the same letter.) Pfssa’s differ from probabilistic finite state automata, 
pfsa’s, only in the presence of the letter labels on the arcs. (Also, sometimes 
pfsa’s come equipped with a designated “starting state’ which may or may not 
ever be revisited. In this text, there will be no such starting states.) A pfssa pro- 
duces source text in the following fashion: time is discretized, and an imaginary 
entity, the source gremlin, is moving among the states of the pfssa, making one 
move per pulse of time. When the gremlin makes a move, the gremlin chooses 
an arc leaving the state in which it currently resides probabilistically, with the 
different arcs having the chance of being chosen indicated by their probability 
labels. Whichever arc is chosen, the letter label of that arc is emitted; it is the 
next letter of the source text. Notice that the source text produced is hypothet- 
ically a two-way infinite string, with no beginning and no end. Because the 
text is produced probabilistically, there are typically many (a very great many!) 
different possible two-way infinite strings that could be produced by a given 
psffa. When we are reading a particular piece of source text, what we have on 
our hands is a finite substring of one of the possible two-way infinite strings of 
source letters producible by the pfssa. 

Note that the first-order diagrams in Section 7.4 have no letter labels on the 
arcs. However, they can be considered pfssa’s because the states are identified 
with the letters of a source text; in this identification, each letter stands for a 
first-order context, but there is no harm in using the same letter as indicating the 
“next letter’, as well. That is, each arc is considered to be labeled with the letter 
with which its destination node is identified. 

The mathematics of pfsa’s had been pretty well worked out by the late 
1940’s, when Shannon created information theory, and he made good use of 
that mathematics in working out the essentials of pfssa’s. What follows is a 
brief account of some of that mathematics, and some of those essentials. 

For two states, S, S’ in a pfsa D, the transition probability from S to S’, 
denoted q(S, S’), is the sum of the probability labels on the arcs going from S 
to S’. Thus g(S, S’) is interpretable as the probability that the pfsa gremlin will 
next be in S’, if currently residing in S. 

If the states of D are ordered, S1,..., S,, then we will abbreviate g (S;, 5; ) = 
qgij. Thus, for the pfssa given in 7.5.1, with states S),S2,S3, the matrix Q = [q;;] 
of transition probabilities is 


3 1 
eo 
Q=|3 3 0 
0 1 0 


Now, let us consider the proposition that there are state probabilities P(S), 
the probability that, at a randomly selected pulse of time, at the end of that pulse 
the pfsa gremlin will be residing in state $. The problem with this “definition,” 
aside from the fact that is presumes existence, is in the phrase “a randomly 
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selected pulse of time.” There is no probability assignment to the integers which 
gives each integer (i.e., each pulse of time) the same probability. 

The standard way of dealing with this difficulty is rather daunting: the set 
of all possible two-sided sequences of states in which the gremlin might be 
residing is made into a probability measure space in such a way that for each 
state S and for any two different sequence places — i.e. pulses of time — the 
measures of the two sets of sequences with S appearing in those places are 
the same; call this common value P(S). (See [66] for an account of how the 
measure is defined, in slightly different circumstances.) 

Here is a more facile approach that bypasses the philosophical problems 
in defining the probabilities P(S) and leads directly to their computation. If it 
were possible to pick a pulse of time at random, then the next pulse would be 
selected with equal randomness. Thus, if the state probabilities exist, they must 
satisfy 


P(S') =) P(S)q(S,S') 
S 


for each state S’ of D, with the sum above taken over the set of states of D. This 
leads to the following. 


7.5.2 Definition If a pfsa D has states S,,...,S, with transition probabilities 
O=([4(S;, S;)] = [qi;], and if the linear system QO'p =p has a unique solution 
among the probability vectors 

Pl 

B=) 2 is 

Pn 
then (and only then) p; = P(S;) will be called the state probability of (or, the 
probability of the state) S;,i =1,...,n. 


7.5.3 For example, for the pfssa of Example 7.5.1, regarded as a pfsa, it is 
straightforward to verify that the homogeneous linear system with matrix of 


coefficients 
-1 ¥% 0 
T =| 3 2 

Q-f=)|q -3 1 

x7. 0 -1 
Pi 4/11 
has a unique solution among the probability vectors, namely | p2 | = | 6/11 
P3 1/11 


On the other hand, for the pfsa 
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I 
q 
1 
° Cl 1 
(sO! 
the system 
: 5 0 0) (pi PI 
O'p=\5 1 Of m]=|[m 
1 
q 1} \p3 P3 
0 
has an infinite number of solutions among the probability vectors: t isa 
1-t 


solution for any tf € [0, 1]. 

A directed graph (digraph) is strongly connected if for any ordered pair 
(u,v) of different nodes in the graph there is a directed walk (i.e., a walk along 
arcs in the directions of the arcs) in the digraph from u to v. For example, the 
underlying digraph of the pfssa in Example 7.5.1 is strongly connected, while 
the pfsa in 7.5.3 is not strongly connected—there is no way to walk from $2 
to $3 in the digraph. The following is the great truth about strongly connected 
pfsa’s upon which much else is based. 


7.5.4 Suppose that D is a strongly connected pfsa with transition probabilities 

OQ = [qij]. Then gr P = p has a unique solution among the probability vectors 
See [21] for a proof of this, and a characterization of those D for which the 

conclusion of 7.5.4 holds. (It is not necessary that D be strongly connected.) 


Henceforward, our pfssa’s will be assumed to be strongly connected, with 


states S$ ],...,S,, transition probabilities qjj, 1 <i, j <n, and state proba- 
bilities pj = P(S;). Let the source alphabet be S = {51,...,5,}. Then for 
1<ij,...,i%,<m,1<k, the relative frequency of the k-gram sj, ...5;, in the 


source text is given by 


une wade, 2 [een 


in which the inner sum is taken over those sequences e),...,ex on a directed 
walk starting from S$; whose letter labels are 5;,,...,5;,, in that order. (Recall 
that g(e) is the probability on the arc e.) Thus, for instance, for the pfssa of 
7.5.1, in view of 7.5.3, using the letters instead of ss a andicate the k- 


gram, es single letter a are given by f(a) = zt 4 t= a + ,f(b)= 
aq 5= = 7 foO= ait a ; = 7 f@= qil= Me anwhile, a 
of digram frequencies are f (ac) = ais + G4 . i= a and f(ad) = 
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With the relative k-gram frequencies at our disposal, for all k = 1,2,..., 
we have the kth-order entropies H“(D) of the text produced by a strongly 
connected pfssa D, as in section 7.2, and H ()(D) = limg-+oo H (D). One of 
Shannon’s great theorems is that H‘~)(D) is directly computable if the struc- 
ture of D is known: it is the average, over the states, of the ordinary, zeroth-order 
entropies of the zeroth-order sources you would get by turning each arc into a 
loop, returning to the state it comes from. 


7.5.5 Theorem (Shannon [63]) Suppose that D is a strongly connected pfssa 
with states S,,..., S,, with state probabilities P(S;) = pj,i =1,...,n. Suppose 
that the source alphabet is S = {s1,...,5m}, and for eachi € {1,...,m}, j € 
{1,...,2}, hij = P(s; | S;) is the sum of the probability labels on arcs leaving 
S; with letter label s;. Then 


n m 
H©)(D)=—) "pi Yo hi loghi;. 
i=] jr 
For example, if D is the pfssa in 7.5.1, then 


1 

4 
iD Pigg as higesi se 
—[-log—+-—lo —-0. 
is 00 Be 


HOD) = “| lo revere 34 I 4] 
= fen tg ae 


Simulating a source with a pfssa 


Suppose we have some source D emitting text, with source alphabet S = {s1,..., 
Sm}. We do not presume anything about the inner workings of the source; it 
need not be a pfssa. We suppose that we have somehow (perhaps by sam- 
pling the text) obtained the relative (k + 1)-gram frequencies f(i1,...,ix41) = 
f (Si, «+ Sip), 1S u1,..-,tk+1 < m, of blocks of k + 1 consecutive letters in the 
source text, for some k > 0. 

The kth-order simulant D™ of the given source is the pfssa whose states 
are (or, are identified with) those kth order contexts s;, ...s;, which have positive 
relative frequency f(i1,...,ix) = Dee f(i,.--,ix, J). When k = 0 there is 
only one state, and the arcs are loops with (probability, letter) labels (fj, 5;),i = 
1,...,m. When k > 0, for each context s;, ...5;, with positive relative frequency 
there is an arc to the state of each context s;, ...s;,8; such that f(i1,...,i%, j)) > 
be a bit misleading. In this case there is an arc from state (context) s; to state 
sj, whenever f(i, j) > 0, with probability label f(i, j)/fj and letter label s;.] 
As noted previously, in the special case k = | it is convenient to let the name of 
the destination state serve as the letter label on the arc. In fact, a convenience 
of the same type applies to any of these simulant pfssa’s, when k > 0; the letter 
label on an arc into a state corresponding to a context will be understood to be 
the last letter of that context. 
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The first-order simulant of the pfssa in 7.5.1 has four states, corresponding 
to the four letters a,b, c,d. Using the convention above about letter labels, this 
simulant is 


oo 


Y 
oe 


PERE Ba 5 nee 
cs 
$ 3 


ole 


A good deal of calculation went into making this simulant. For example, to 
calculate the probability on the arc from context a to context c (which arc has 
letter label c, it is understood), we calculated 
621 411 4 

ICO haa 1143" 33 
and then f (ac)/f (a) = (4/33)/(5/11) = 4/15. 

The second-order simulant for the pfssa of 7.5.1 will have 10 states, not 16, 
because six of the second-order contexts, namely ad, bb, bd, cb, db and dd, 
never occur in the text generated by that pfssa. In that second-order simulant 
there are, for instance, arcs from the state ab to the states ba and bc, and only to 
those. There are arcs from bc to ca and cc, but not to cd, because f (bcd) = 0 
(i.e., bcd does not appear in the source text). The arc from state ab to state bc 
has probability label f(abc)/f (ab) = (2/33)/(2/11) = 1/3 and letter label c. 

We are just using the pfssa of 7.5.1 as an example — there doesn’t seem to 
be any good reason to replace a perfectly good pfssa as in 7.5.1 by a simulant 
that is more complicated than the original. Simulants are for sources of which 
the inner workings are unknown, but the statistics of the source text output are 
available. 

Shannon had simulants in mind as pfssa simulations of natural language 
production, using either the alphabet of the language, plus some punctuation 
marks, as a source alphabet, or, more promisingly, a significant subset of the 
set of words of the language. If one took as the source alphabet only the 1,000 
most commonly used English words, and compiled statistics by running through 
a huge assortment of modern texts, the third-order simulant obtainable might 
have as many as (1000)? = one billion states. Higher order simulants will be 
impossibly huge. So it appears that while simulants might be of interest in 
the theory of pfssa’s and their uses in simulating sources, they are not really 
practical for playing around with artificial language production. 

Simulants are too complicated for natural language simulation in one way, 
and not complicated enough in another. We think that a rich area of exploration 
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would be opened by letting time play a more dynamic role in the operation of 
pfssa’s: let the probability labels on the arcs be allowed to vary from pulse to 
pulse. So far as we know, no work has been done on this obvious idea—and we 
shall do none here. We return to the subject of simulation by pfssa’s and finish 
this section by giving some facts and raising some questions. 

Suppose that D“? is the kth-order simulant of a source D with source al- 
phabet S = {5),...,5m}. Proofs of 7.5.7-7.5.9 can be found in [21]; a proof of 
7.5.6 will appear soon. 


7.5.6 If D is a “true source” then D“) is strongly connected. More precisely, 
D™ is strongly connected if the (k + 1)-gram frequencies used to construct it 


satisfy 
(i) peat fGi. ie D= Vy FS G,i1,---, tk) foreachi,,...,7% €{1,..., 
m}; and 


(ii) for no proper subset S’ of S is it the case that f (i1,...,i41) > 0 only for 
Gy. fz) € (SUG Sy. 


7.5.7 If D™ is strongly connected, the state probabilities for D™ are what you 
would expect: P (“‘s;, ...8i,.") = fi, ..-, ik). 


7.5.8 If D“ is strongly connected, the relative (k + 1)-gram frequencies in the 
source text produced by D“ are the same as those in the source text produced 
by D. Therefore, H)(D) = H‘)(D),0 <j <k. 


75.9 H®) (DY) =H“ (D®), 


Two sources with the same source alphabet are equivalent if the relative 
k-gram frequencies in the texts produced by the two sources are the same, for 
every k. The great theoretical goal, given a source of unknown structure, is to 
produce a pfssa which is equivalent to the given source. Let us call a source 
which satisfies (1) and (11) of 7.5.6 for every k > 1 a true source. 


7.5.10 Is every true source equivalent to a strongly connected pfssa? If not, 
how can the source texts produced by strongly connected pfssa’s be distin- 
guished from source texts produced by true sources that are not equivalent to 
pfssa’s? 


7.5.11 If D is a true source with kth-order simulant D™, and if H‘©)(D) = 
H“)(D), are D and D“ necessarily equivalent? 


Exercises 7.5 


1. Find the state probabilities, single letter, digram, and trigram relative fre- 
quencies, and the zeroth-, first-, and second-order simulants of the follow- 
ing pfssa. 
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2 
C.F 


Notice that for k > 2, a k-gram in the text generated by this pfssa must end 
in one of ab,ac,bb,bc,ca. For a non-empty word u € {a,b,c}T, let u’ 
denote the word obtained by deleting the first letter of u (on the left). Thus, 
if lgth(u) = 1, then u’ = A, the empty word. We will use this notation to 
talk about kth-order contexts in the text generated by D, and the kth order 
simulants of D. For instance, for k > 3, a state wab in D™ (gth(u) =k —2) 
has arcs to the states u’abb and u’abc. 


(a) For k > 3, describe the states in D™ receiving arcs from the states of 
the form uab, uac,ubb, ubc, and uca. Give the probability and letter 
labels on these arcs. 

(b) By 7.5.6 and 7.5.7, H®(D) = H® (D®) = H™®)(D”), Use this to 
show that the sequence (H (k) (D))x>0 1s strictly decreasing, and that, 
therefore, D is not equivalent to any of its simulants. 
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Adaptive Methods 


The methods to be described in this chapter are, as in Chapters 5 and 6, meth- 
ods for lossless encoding of source text over a fixed source alphabet S = {s1, 
...,5m}. The new feature is that no statistical study of the source text need be 
done beforehand. The encoder starts right in encoding the source text, initially 
according to some convention, but with the intention of modifying the encoding 
procedure as more and more source text is scanned. In the cases of adaptive 
Huffman (Section 8.1) and adaptive arithmetic (Section 8.3) encoding, the en- 
coder keeps statistics, in the form of counts of source letters, as the encoding 
proceeds, and modifies the encoding according to those statistics. In interval 
and in recency rank encoding (Section 8.4), encoding of a source letter also 
depends on the statistical nature of the source text preceding the letter, but the 
statistics gathering is not boundless; the encoding rules do not change, but rather 
cleverly take into account the recent statistical history of the source text. 

With the encoding procedure varying according to the statistical nature of 
the source text (unlike higher-order encoding in which the encoding procedure 
varies according to the syntactic nature of the source text), you might wonder, 
how is decoding going to work? No problem, in principle; notice that after 
the decoder has managed to decode an initial segment of the source text, the 
decoder then knows as much about the statistical nature of that segment as did 
the encoder, having come that far in the encoding process. If the decoder knows 
the rules and conventions under which the encoder proceeded, then the decoder 
will know how to decode the next letter. Thus, besides the code stream, the 
decoder will have to be supplied with the details of how the encoder started, and 
of how the encoder will proceed in each situation. The hidden cost suffered here 
is fixed, independent of the length of the source text, and is, therefore, usually 
essentially negligible; indeed, the required understanding between encoder and 
decoder can be “built in” in any implementation of an adaptive method and need 
not be resupplied for each instance of source text. 

A note on classification: the “dictionary” methods to be described in Chap- 
ter 9 are adaptive, but so different from the methods of this chapter that they 
deserve a chapter to themselves. Some would argue that interval and recency 
rank encoding are really dictionary methods; we leave debate on the matter to 
those who enjoy debate. 
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8.1 Adaptive Huffman encoding 


In adaptive Huffman encoding, counts of the source letters are kept as scanning 
of the source text proceeds, and the Huffman tree and corresponding encoding 
scheme change accordingly. The counts are proportional to the relative fre- 
quencies of the source letters in the segment of source text processed so far. 
(Or, almost proportional—in many cases it is convenient to start with the source 
letter counts set at some small positive value, like one, which makes the ostensi- 
ble relative frequencies slightly different from the true relative frequencies. The 
difference diminishes as you go deeper into the source text. Itis possible to start 
with all counts at zero. The drawback is that the presence of zero node weights 
complicates the tree management technique due to Knuth, to be described in 
the next section.) Thus it makes sense to use the counts as weights on the leaf 
nodes—the nodes associated with the source letters—of the Huffman tree. 

Unfortunately, different Huffman trees can be constructed from the same 
leaf node weights because of choices that sometimes have to be made when two 
or more nodes carry the same weight. Also, different encoding schemes are 
associable with the same Huffman tree, because of different choices that can 
be made in labeling the edges. Therefore, to do adaptive Huffman encoding 
and decoding, we suffer the annoying necessity of adopting explicit conven- 
tions governing how we start and how the choices are to be made in drawing 
the Huffman trees and labeling their edges. We will also need conventions gov- 
erning how to go from one Huffman tree to the next after a count has been 
incremented. 

The conventions of this section, to be described below, are not really prac- 
tical, but will (we think) serve better than “the real thing” as an introduction 
to adaptive Huffman coding. The real thing, i.e., the actual conventions used 
in practice are a bit trickier to describe; they are well adapted for computer 
implementations, but not so easy to walk through in doing pencil and paper ex- 
ercises. We describe the real conventions, the Knuth-Gallager method of tree 
management, in Section 8.2. 

In this section we shall draw our Huffman trees horizontally, as we did in 
Chapter 5. The weights in the leaf nodes will be non-increasing as you scan 
from top to bottom. Each leaf node establishes a /evel in the tree; all nodes will 
be on one of these levels. When two nodes are merged, i.e., when they become 
siblings, the parent node will be on the higher of the two levels of the sibling 
nodes. (Thus the root node is guaranteed to be on the level of the highest leaf, 
not that this is any great advantage.) Here’s an example: 
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53 


0 
a @) (9) 


1 51 > 10 
sr (5 so —> 11 
©. 53 — 00 
56 (4) (8) s4—> O11] 
1 s5 > 0110 


SS Ow S62 010 
1 


sa Gy) 


This example illustrates two other conventions we will observe: (1) in the 
labeling of the edges (or “branches’’) of the tree, the edge going from parent to 
highest sibling is labeled zero, and the lower edge is labeled 1; (2) we “merge 
from below in case of ties.” For instance, in the construction of the tree above, 
when it came to pass that the smallest weight of an unmerged node was 5, there 
were three such nodes to choose among; we chose to merge the two lowest in 
level. 

It remains to specify how to get started, and how the tree will be modified 
when a count is incremented. 

Start: All letters start with a count of 1, and the initial ranking of the source 
letters, from top to bottom, will be s1,..., 5. Thus the initial Huffman tree for 
source text over S = {51, 52, 53,54, 55,56} will look like this: 


1 OF 
1 1 
Ss, 7? 00 
S3 uw : 5S. 01 
1 1 53 — 100 


54 Ga) sa > 101 


s5 110 
55 OY s6 > 111 
1 


Update after count incrementation: The assignment of leaf nodes to source 
letters is redone so that the weights are once again in non-increasing order, and 
the source letter whose count has gone up by one is now attached to the lowest 
possible node consistent with this monotonicity; the ranking of the other letters 
is unchanged, but for the possible promotion of that one letter. For instance, in 
the first example, if any of 54, 55, or sg is the next letter scanned, the weight in the 
corresponding leaf node is increased by one and the letter-to-leaf assignments 
remain the same. In this particular example, incrementing any one of these letter 
counts does not affect the tree structure, and the encoding scheme remains the 
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same. If s2 is scanned, the next tree has its top three leaf nodes assigned to s2, 53, 
and 5s, in that order, with weights 6, 5, and 5, respectively. If s; is scanned, the 
order of the top 3 letters is 51,53, 52; if s3 is scanned, the order stays as it is. 

Let’s try some encoding! We will take S = {51, 52,53, 54,55, 56} and source 
text starting 5353525651 5152565551535355565152S2S2. We leave the full encoding 
of these 18 letters as Exercise 8.1.1, but this is how the code will start: 


The Huffman tree after the first two letters (s3’s) are scanned looks like this: 


os 
© = 


S] 


S2 
S4 
S5 


56 


(Are you surprised that the tree comes out like this? Recall the convention that 
when there is a choice of unmerged nodes with smallest possible weights, merge 
the two at the lowest possible level.) 

Now let’s try decoding. With S = {51, 52,53, 54,55, 56}, and all the conven- 
tions mentioned above, suppose the decoder is faced with 


01100011001011111100... 


The decoder knows the starting scheme; scanning left to right, the decoder rec- 
ognizes 01, which is the code word in the starting scheme for sz. Now the 
decoder knows what the next Huffman tree will be: 


2 (2) 
1 


83 OSS 


«@ 


5. 
oe 


Having recorded s2, the decoder resumes scanning and soon recognizes 10, the 
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code word in the current scheme, derived from the tree above, for s;. The 
decoder records s;, makes a new tree, with s; now having a count of 2, and 
forges on. 


8.1.1 Compression and readjustment 


Keeping letter counts as the source text is scanned amounts to estimating the 
relative source frequencies by sampling. Therefore, the Huffman tree and as- 
sociated encoding scheme derived from the letter counts are expected to set- 
tle down eventually to the fixed tree and scheme that might have arisen from 
counting the letters in a large sample of source text before encoding. Because 
adaptive Huffman encoding is a bit of trouble, and decoding is slow, you might 
think it is worth the trouble to do a statistical study of the source text first, and 
proceed with a fixed encoding scheme. 

But adaptive Huffman encoding offers an advantage over plain zeroth-order 
Huffman encoding that can be quite important in situations when the nature 
of the source text might change. Let us illustrate by taking an extreme case, 
in which the source text consists of the letter a repeated 10,000 times, then 
the letter b repeated 10,000 times, then the letter c repeated 10,000 times, and 
finally the letter d repeated 10,000 times. A thorough statistical study of the 
source text will reveal that the relative source frequencies are all 1/4. Plain 
zeroth-order Huffman encoding will represent a, b, c, and d by the four binary 
words of length 2. (Of course, first-order Huffman encoding will do very well 
in this example, but we are trying to compare zeroth-order adaptive and non- 
adaptive Huffman encoding.) 

Now, adaptive Huffman encoding will start well, and, whatever conventions 
are in force, will encode almost all of those a’s by single digits. However, the 
b’s will be encoded with 2 bits each, and then the c’s and d’s with 3 bits each, 
squandering the early advantage over static Huffman encoding. Obviously the 
problem is that when the nature of the source text changes, one or more of the 
letters may have built up such hefty counts that it takes a long time for the other 
letters to catch up. To put it another way, the new statistical nature of the source 
text is obscured by the statistics from the earlier part of the source text. 

This defect has a perfectly straightforward remedy, first proposed, we think, 
by Gallager [23]. The trick is to periodically multiply all the counts by some 
fraction, like 1/16, and round down. (If the counts are stored as binary numbers, 
this operation is particularly easy if the fraction is an integral power of 1/2.) The 
beauty of this trick is that if the source text is not in the process of changing its 
statistical nature, no harm is done, because the new counts after multiplication 
and rounding are approximately proportional to the old counts; and if the source 
text is changing, the pile of statistics from earlier text has been reduced from a 
mountain to a molehill, and it will not take so long for the statistical lineaments 
of the current text to emerge in the letter counts. 

Of course, the decoder must know when and by how much the counts are 
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to be scaled down. Also, if zero counts are not allowed, then steps might have 
to be taken to deal with occasional rounding down to zero. But this is a mere 
annoying detail. 


eee 


8.1.2 Higher-order adaptive Huffman encoding 


For k > 1, kth-order adaptive Huffman encoding proceeds as you might sup- 
pose: for each kth-order context 5;, --- s;, the encoder keeps counts of the source 
letters occurring in that context (i.e., the scanning of s;, --- 5;, 5; causes the count 
of s; in context s;, ---s;, to increase by 1). Huffman trees are maintained for 
each context. The encoder and decoder have to agree on conventions for tree 
formation and maintenance (updating) and also on how to get started; pretty 
obviously these starting rules will be more extensive than in the zeroth-order 
case. It seems reasonable to agree to encode the first k letters (before any con- 
text is established) according to the zeroth-order adaptive Huffman conventions, 
whatever they may be. 

For example, let us take S = {5),..., 56} and the source text we looked at 
before, 


S3535256S1 818286858 1 S3S3S5S6S1SISIS2 +++ («) 


With tree formation and updating conventions as before, and with all single 
letter counts starting at 1 (as before) for zeroth-order encoding, let us stipulate 
that all counts start at 0 in the context schemes; thus, each Huffman tree, in each 
context, starts off like so: 


“9 
1 
$=" 
0 
S3 (0) 5, 20 
1 52 > 10 
0 
54 Ov 53 > 110 
1 s4 > 1110 
55 (0) : s5 > 11110 
1 s6 > 11111 
6 @) 


Note that we are allowing zero counts in this instance. 
If we attempt second-order adaptive Huffman encoding of («), with the 
various conventions and stipulations, the encoding starts 


The first two code words are just as before, in the zeroth-order case. After that, 
the encoding scheme will be stuck on the scheme above, the starting scheme 
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for all contexts, until a context repeats. In (x) the first context to repeat is 
5256. However, since s; followed s25¢6 the first time s2s6 occurred, the encod- 
ing scheme for context s256 will not have changed, and the ss that follows its 
second occurrence will be encoded 11110. (Verify!) By contrast, by the second 
occurrence of context 5152, the tree for that context is as follows: 


S6 OH ) 
(0) 0 


Thus the sz following the second occurrence of 5152 is encoded 110. 

If the source alphabet is quite large, say m = |S| = 256, for instance, then 
maintaining the context trees for k > 0 can be expensive. There are m7 contexts 
of order 2, and each context tree has m leaf nodes, and that is really a lot of stuff 
to maintain and to hunt through. Of course, the same volume of stuff has to be 
kept in static (non-adaptive) higher-order Huffman encoding. One way to lessen 
the burden in static kth-order Huffman encoding is to store the array of (k + 1)- 
gram relative frequencies in some sort of alphabetized order so that the line 
fG,....tk, D,..., fGi,...,i%,m) for context sj, ---s;, can be quickly found, 
and then the tree and scheme for that context can be quickly constructed (and 
stored, if desired) from those relative frequencies. The savings achieved are not 
great; it does not take much more space to store the static context schemes than 
it does to store the (k + 1)-gram relative frequencies. 

In adaptive higher-order Huffman encoding we do not have the simplicity 
of fixed encoding schemes for each context, and it turns out that something 
similar to the storage procedure suggested above is useful. Letter counts in 
contexts replace the (k + 1)-gram relative frequencies. As we shall see in the 
next section, lists of counts with a few frills and pointers added can function as 
easily updated Huffman trees/schemes. 

There is a further savings possible with adaptive kth-order Huffman (and 
with adaptive kth-order arithmetic) coding that takes advantage of the fact that 
some contexts may occur very infrequently. The trick is not to reserve space 
and maintain counts for a particular context until that context actually occurs. 
For more detail on the implementation of adaptive higher-order encoding, we 
refer the interested reader to Bell, Cleary, and Witten [8]. 
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Exercises 8.1 


1. Complete the zeroth-order adaptive Huffman encoding of the source text 
labeled (*), above, under the conventions of this section. 


2. Give the first-order adaptive Huffman encoding of the source text labeled 
(*), under the conventions of this section. (In particular: in each context 
the letter counts start at 0.) 


3. Complete the decoding of 01100011001011111100..., with S = {s;, 52,53, 
54, 55,56}, under the conventions for zeroth-order adaptive Huffman coding 
in this section. 


4. Decode the fragment of code in problem 3 under the conventions for first- 
order adaptive Huffman coding in this section (with S' as above). 


8.2 Maintaining the tree in adaptive Huffman encoding: 
the method of Knuth and Gallager 


The problem with the version of adaptive Huffman encoding presented in Sec- 
tion 8.1 is in the updating of the Huffman tree, after a count is incremented. 
Sometimes the tree does not change at all, and sometimes it changes drastically. 
It can change drastically even if the order of assignment of source letters to leaf 
nodes does not change. There does not seem to be any easy way to see what the 
next tree will look like; you have to construct the full tree at each stage, after 
each source letter, and that is a lot of trouble, especially if m = |S| is large. 

In practice, the Huffman tree at each stage is a sequence of registers or 
locations, each representing a node. The contents of each register will be (a) 
the weight of the node, (b) pointers to the sibling children of that node, or an 
indication of which source letter the node represents, should it be a leaf, (c) 
a pointer to the parent of the node, unless the node is the root node, and (d) 
optionally, an indication, in case the node has sibling children, as to which edge 
to them is labeled 0, and which is labeled 1. (We will see later why this feature 
could be optional.) 

A tree stored in this way can be used for decoding more easily than the 
encoding scheme associated with the tree. In order to decode binary code text, 
start at the root node of the tree; having scanned the first bit, go to the register 
associated with the sibling child of the root node indicated by that bit, whether 
0 or 1. Continue in this way until you arrive at a leaf node; decode the segment 
of code just scanned as the source letter assigned to that leaf, update the tree, 
return to the root node, and resume scanning. 

Thus, in any Huffman encoding, whether adaptive or not, the Huffman tree 
can serve efficiently as the Huffman encoding scheme. Now, keeping in mind 
the sequence-of-registers form of the Huffman tree in storage, we will look at 
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the efficient tree-updating algorithm due to Knuth [40], based on a mathematical 
result of Gallager [23]. First, we need to understand Gallager’s result. 

The tree resulting from an application of Huffman’s algorithm to an initial 
assignment of weights to the leaf nodes belongs to a special class of diagrams 
called binary trees.! It is easy to see by induction on m that a binary tree with m 
leaf nodes has a total of 2m — 1 nodes altogether, and thus 2m — 2 nodes other 
than the root. (See Exercise 8.2.3.) Gallager’s result is the following. 


8.2.1 Theorem Suppose that T is a binary tree with m leaf nodes, m > 2, with 
each node u assigned a non-negative weight wt(u). Suppose that each parent is 
weighted with the sum of the weights of its children. Then T is a Huffman tree 
(meaning T is obtainable by some instance of Huffman’s algorithm applied to 
the leaf nodes, with their weights), if and only if the 2m — 2 non-root nodes of 
T can be arranged in a sequence u},U2,...,U2m—2 With the properties that 


(a) wt(u) < wt(u2) < +++ < wt(u2m—2) and 


(b) for 1<k <m—1, ux 1 and ux are siblings, inT. 


We leave the proof of this theorem as an exercise (see Exercise 8.2.4) for 
those interested. 

Both Gallager [23] and Knuth [40] propose to manage the Huffman tree at 
each stage in adaptive Huffman encoding by ordering the nodes u1,...,W2m—1 
so that the weight on u;z is non-decreasing with k and so that u2,_ 1 and u2,% 
are siblings in the tree, for 1 < k <m—1. (u2m—1 will be the root node.) This 
arrangement allows updating after a count increment in a leaf node in, at worst, 
a constant times m operations (where m is the number of leaf nodes), while re- 
doing the whole tree from scratch requires a constant times m? operations. Plus, 
you never catch a break redoing the whole tree; it always takes the same number 
of operations, whereas applying the method of Knuth and Gallager sometimes 
updates the tree in very few operations. 

We refer to the method of Knuth and Gallager because the two methods are 
essentially the same, a fact that may not be evident from their descriptions; but 
the fact is that they always result in the same updated tree (except, possibly, for 
the leaf labels) by essentially the same steps. Gallager’s, the first on the scene, is 
the slower and more awkward to carry out. Indeed, you can think of Gallager’s 
method as a somewhat inefficient way of carrying out Knuth’s algorithm, al- 
though the historical truth is that Knuth’s algorithm is a clever improvement of 
Gallager’s method. We shall give an account of Gallager’s method here as a 
historical curiosity, and to warm the reader up for Knuth’s algorithm, but the 
reader keen on applications may skip the subsection on Gallager’s method. 


'The technical definition of binary trees is easy but unmemorable: a binary tree is an acyclic 
connected graph in which exactly one node has degree 2 and all the other nodes have degrees 3 or 
1. You can think of a binary tree as the finite diagram obtained by starting at the root node, drawing 
two edges or branches to two sibling children of the root node, and continuing in this way, deciding 
for each new node whether it will have two children or remain a childless leaf. The finiteness 
requirement says that you cannot continue forever allowing some node or other to have children. 
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In describing the method, we will identify the nodes of the tree with the 
registers or locations in which information about the nodes are stored. Both 
versions of the method use node interchange, which results in entire subtrees 
of the current tree being lopped off and regrafted in new positions. It works 
like this: suppose u; and uw; are distinct nodes in a node-weighted binary tree. 
To exchange these nodes, switch the node weights and the names of associated 
source letters, in case one or both of u;,u; are leaf nodes, between the locations. 
Leave the “forward” pointers to the parents as they are, in locations u; and u;. 
So, effectively, the nodes have exchanged parents; but they keep their children 
and other descendants, and to effect this you have to go to the registers of the 
children of u; (and of u;), if any, and change their forward pointers so that they 
point to u; (resp. u;). Also, the backward pointers, if any, in u; and u; have to 
switch. (Of course, the names of the nodes are switched as well, so that what 
was once referred to as u; is now u;, but this has nothing to do with what is 
going on in the registers.) 

For example, consider the Huffman tree below, lifted from Knuth’s paper. 
Edges will substitute for pointers and the weights are indicated inside the nodes, 
as usual. The names of the nodes are beside the nodes, as well as the names of 
the associated source letters, in the case of leaf nodes. 


ey 
Tene 
uss @ © mn 


U2} S6 (3) u4 
ein) 


ian ® 


Verify that the weight stored at ux, is a non-decreasing function of k, and that 
successive nodes u2x—1, U2, 1 < k <5, in the list w1,..., 40 are siblings. 

Now, the interchange of u4 and us results in the tree (72). This also is a 
Huffman tree, with nodes in the “Gallager order” described in Theorem 8.2.1. 
It is worth noting that if you start with one Huffman tree with nodes in Gallager 
order and interchange two nodes, if the two nodes have the same weight then 
the result will be a Huffman tree with nodes in Gallager order. If the two nodes 
do not have the same weight then the resulting tree could be a Huffman tree, 
but, in any case, the nodes will not be in Gallager order; requirement (a) will be 
no longer satisfied. 


(11) 


Ug 
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ise 


U4; S2 (s) uz 410 


(T2) 


U2; S6 U5 ug 


iiss ©) 


Note also the “subtree regrafting” feature of this interchange. The subtree 
originally rooted at u4 has been snipped off and regrafted into the tree, now 
with us as the root. The same is true of the subtree originally rooted at us— 
it is rerooted at u4—but this rerooting is not very dramatic since that subtree 
consists of a single node. 


8.2.1 Gallager’s method 


We suppose that the nodes u1,...,“2m—1 of the Huffman tree are stored in Gal- 
lager order; for 1 < k <m—1, we refer to u2%_1, U2 as a sibling pair. 

Suppose a count wt(u) in a leaf node u is to be incremented. Change its 
weight to wt(u) + 1 and inspect the next sibling pair up from the one u is in. 
If the smaller weight in this sibling pair, say on the node v, is smaller than 
wt(u) + 1 (which will be the case if and only if wt(u) = wt(v)), interchange u 
and v. In case there is a choice, interchange u with the sibling with larger index. 
Now again compare wt(u) + | with the weights in the next sibling pair up (from 
v). Interchange if wt(u) + | is larger than one of these. Continue interchanging 
until the incremented weight, wt(u) + 1, is, in fact, smaller than both weights 
in the sibling pair beyond the node on which that weight now resides, or until 
there is no sibling pair beyond that node (which will happen if and only if the 
current residence of wt(u) + 1| is a child of the root node). 

Suppose the incremented weight is now at node #7. Next increment the 
weight on the parent of w by one. If the parent of w is the root node, you are 
done. Otherwise, proceed with the parent of w as with u, interchanging until its 
incremented weight is no greater than that on either node in the next sibling pair 
up. Then increment its parent—and so on. 

For example, suppose the current Huffman tree is 7), above, and s¢ is 
scanned. The count in uz is increased to 4. The next sibling pair up from 
U2 iS U3, U4, each with weight 5, so we leave location uz alone (for now; it will 
soon have its forward pointers changed) and increment the weight in u4 by 1; 
it is now 6. This is greater than the weight in ws, in the next sibling pair up, so 
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nodes u4 and us are interchanged. (So, in particular, nodes u; and u2 are now 
the children of us.) Now the weight 6 in us is less than each weight in v7 and 
ug, corresponding to the next sibling pair up, so increment the weight in wg (the 
parent of us) by | up to 12. This is greater than the weight in u9, so exchange 
ug and ug. Finally, increment the root node. The result of all this: 


uj; S3 (2) 


U2; S6 (4) us ug 
U6; S4 (6) uj) 
U3; 5] 


U4; S2 uz “10 
ean) 


It is heartily recommended that the reader work through the steps in this exam- 
ple. 


(T3) 


8.2.2 Knuth’s algorithm 


The main differences between the methods of Knuth and Gallager are that 
Knuth’s calls for node interchange before any counts are incremented, and there 
is no emphasis at all on sibling pairs. (Nor did there need to be, really, in Gal- 
lager’s method.) All node interchanges are of nodes of equal weight, so there 
need be no interchange of weights between locations. Interchanges involve only 
changes of pointers and of source letter identities of leaf nodes. 

We start with the nodes in Gallager order, u1,...,“2m—1. Suppose that the 
count on a leaf node u, currently with count wt(u), is to be incremented. Look 
up the list from wu and find the node w furthest up the list with the same weight, 
wt(u), as u, and interchange u and w. [This @ will be the same node discovered 
in the first stage of Gallager’s method.] Now go to the parent of w. If it is the 
root node, go to the incrementation phase, described below. Otherwise, find 
the node the furthest up the list from the parent of 7 with the same weight as 
the parent of w, interchange the parent of # with that node, proceed to the next 
parent, and so on. 

After all the node interchange is finished, the leaf node now corresponding 
to the source letter scanned, and each node on the unique path from that node 
to the root node, have their weights increased by 1, and the update is complete. 

For example, let us again suppose that 7; is the current state of the Huffman 
tree, and that sg is scanned. The current count 3 in w2, corresponding to 56, is 
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the only 3 appearing, so we move to the next parent node, w4; us is the location 
with greatest index (5) bearing the same weight as u4 (5, by coincidence), so 
nodes u4 and us are interchanged. The current picture is then 72, above. 

The next parent to be processed is the node ug with weight 11. The highest 
indexed node with weight 11 is currently wo, so ug and uo are interchanged. The 
next parent is then the root node, so we are done. Now the counts in locations 
U2, Us, Ug, and u4, are increased by 1, and we are ready to scan the next letter. 
Notice that the tree arrived at after incrementation is 73, the tree arrived at by 
Gallager’s method, and that all the interchanges that took place were between 
the same pairs of nodes, in the same order, as in Gallager’s method. It is al- 
ways thus, if all weights on the nodes are positive. We leave verification of this 
statement to the reader. 

There is a slight problem with Knuth’s algorithm as described when zero 
counts are allowed on the leaf nodes; it can happen that the uw; with largest index 
i, with the same weight as the node v you are currently looking at, is w2,—1, the 
root node. In this eventuality, simply interchange v with u2,—2 and proceed to 
the incrementation stage. By an accident of logic and reference, because of the 
business about sibling pairs, this provision is built into Gallager’s method. 

You may be wondering about labeling of the edges of the Huffman tree 
with 0 or 1, in the Knuth-Gallager procedure. The easiest rule is that if a parent 
u has children u2,—1, v2, in the current Gallager ordering of the nodes, then let 
the edge from u to u2% be labeled 0, and the edge from u to u2%_ 1 be labeled 
1. In practice, it might be convenient to have a couple of bits in the register 
corresponding to u reserved for signaling this labeling, even though it can be 
figured out from the order of the sibling children in the current Gallager ordering 


of the nodes. 
wis Ou, 
(14) 


1 
U1; 56 Qa) uz 


The Knuth-Gallager procedure requires some sort of initialization conven- 
tion, because whether the initial counts are set at 1 or 0, there are many choices 
as to the Gallager order of the nodes at the outset. In the exercise set to follow, 
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S = {s1,..., 56}, the initial counts are set at 1, and the initial Gallager ordering is 
indicated by (74). Please note that by the edge-labeling convention mentioned 
above, the edge from wu; to u19 will be labeled 0. 


Exercises 8.2 


1. With the initialization and the edge labeling conventions mentioned above, 
do encoding Exercise 8.1.1 with Knuth’s algorithm governing the tree up- 
dating. 

2. Similarly, do decoding Exercise 8.1.3 with Knuth’s algorithm. 


3. (a) Show that in a binary tree, there must be at least two leaf nodes that 
are siblings. [If a binary tree is defined as a tree formed by a certain 
process, this proposition is evident; the last two children formed will 
be siblings and leaf nodes. 

Here is another proof. Take a node a greatest distance from the root 
node. It and its sibling must be leaf nodes. Why?] 

(b) Show that a binary tree with m leaf nodes has 2m — 1 nodes, total. [Use 
(a) and go by induction on m.] 


4. Prove Theorem 8.2.1. You might follow the following program. 


(a) If T is a binary tree generated by applying Huffman’s algorithm to the 
non-negatively weighted leaf nodes, then the two smallest node weights 
appear on sibling leaf nodes, by appeal to the procedure of formation. By a 
similar appeal, the tree obtained by deleting those two leaf nodes, thereby 
making their parent a leaf, is also a Huffman tree. 


The proof that the nodes of T can be put in Gallager order is now straight- 
forward, by induction on the number of leaf nodes of T. 


(b) Suppose T is a binary tree with non-negatively weighted nodes, with 
each parent weighted with the sum of the weights of its children. Suppose 
the nodes of T can be put in Gallager order, u1,u2,...,U2m—1. If all the 
weights on nodes are positive, then wu; and w2, siblings, must be leaf nodes. 
(Why?) If zero appears as a weight, then it is possible that one of u1, u2 is 
a parent, say of u2x-1, U2%, k > 2, but only if the weights on w2%_1 and ur, 
are both zero (verify this assertion, under the assumptions), in which case 
all the weights on u1,...,u2% are zero. Switch the sibling pairs u;,u2 and 
U2%—1,U2%, in the ordering; the new ordering is still a Gallager ordering. 
Switch again, if necessary, and continue switching until the first two nodes 
in the ordering are sibling leaf nodes. (How can you be sure that all this 
switching will come to an end with the desired result?) Now consider the 
tree T’ obtained from T by deleting u1,u2. Draw the conclusion that T is 
a Huffman tree, by induction on the number of leaf nodes. (The important 
formalities are left to you.) 
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EES 


8.3 Adaptive arithmetic coding 


If you understand the general procedure in arithmetic coding, the main idea in 
adaptive arithmetic coding (including higher-order adaptive arithmetic coding) 
is quite straightforward. Counts of the source letters are maintained in both the 
encoding and in the decoding; in the case of higher-order adaptive arithmetic 
coding, counts are maintained in the different contexts. Whatever the current 
interval is, it is subdivided into subintervals, corresponding to the source letters, 
with lengths proportional (or, in the case of integer intervals, approximately 
proportional) to the current source letter counts. It is simplest to maintain the 
order s1,...,5m of the source letters, as in higher-order arithmetic coding. That 
is, there is usually no good reason to rearrange the source letters, and thus the 
order of the next subintervals, so that the counts of the letters and the lengths of 
the subintervals are in non-increasing order. 

Explicitly, if S = {s1,..., 8m}, if the current interval A = A(sj, ---5;,) (pos- 
sibly the result of rescaling) has left-hand endpoint a and length @, and if 
S],--+,5m have counts c1,...,Cm, respectively, then the endpoints of the subin- 
tervals A(sj,-+-Sig51),.--, A(Si, +++ SigSm) Will be a, a+ Fl, a+ ees eee 
at elf, a+, where C = oY" 1 cj. 

Let’s try an example of adaptive dfwld encoding in the fashion of Sec- 
tion 6.1. We take S = {a,b,c,d}, initial counts all 1, and we will encode 
bbca. At the outset, the intervals A(a), A(b), A(c), and A(d) are, respec- 
tively, [0, 25), [.25,.5), [.5,.75), and [.75, 1). After the first b is scanned, the 
counts of a,b,c,d become 1,2, 1, 1, respectively, so A(b) = [.25, .5) is broken 
into [.25, .3), [.3,.4), [.4,.45), and [.45,.5). When the next b is scanned, we 
are in interval A (bb) = [.3, .4), and rescaling is possible. The resulting interval 
will be divided into subintervals with lengths in the proportions 1,3, 1, 1, which 
will make the arithmetic annoying. Here is the encoding table, similar to these 
in Section 6.1, with an extra column for the counts. As usual, q@ is the left-hand 
endpoint and ¢ is the length of the current interval. 


Counts of Next letter Code 
a,b,c,d or rescale L so far 


io= 
=a 
foci || 


The last interval is [7/15,7/15 + 79x) = [75. 49), the dfwld in which is 
(.01111)2, easily obtained by carrying out the binary expansions of the end- 
points until they disagree. We add the bits of the binary expansion of the dfwld 
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to the “code so far” to obtain 0101111 as the code for bbca. 

Perhaps we are making too much of this. The point is that in both the en- 
coding and the decoding, however they are carried out, the role of the relative 
source frequencies f; (or, in the higher-order cases, of the conditional con- 
text probabilities f(i1,...,ik, 7)/f(ii,..-.,i%)) is taken over by the constantly 
changing ratios c;/C. (In the higher-order cases, these are relative to the con- 
text.) Whatever you did in Section 6.1, you are doing the same in adaptive 
arithmetic coding, with the constantly updated c; /C playing the roles of the f;. 

The same is true in the more practical setting of Section 6.4, but here prob- 
lems arise. Suppose we replace the continuum [0, 1) and exact computation 
by an integer interval [(0,M) = {0,1,..., M — 1}, and approximate computa- 
tion. If the relative source frequencies /f,..., fin are known ahead of time, M 
can be chosen so that, with “underflow condition” rescaling, every letter will 
always have a non-empty interval as a subinterval of the current interval. But 
with adaptive arithmetic coding, we know not what horrors await. M is chosen 
ahead of time, and it may well be that for some j, c;/C will fall so low that the 
subinterval allocated to s; in the current interval vanishes. Another, related dan- 
ger is that when Gallager’s idea of allowing for statistical change in the source 
text by multiplying the counts by a fraction and rounding down (see Section 
8.1) is carried out, some source letter will get count zero, and thus be allocated 
an empty subinterval in the current interval. 

One simple way to deal with both problems is to modify Gallager’s method 
by rounding up instead of rounding down, so that each source letter gets a pos- 
itive count, and to agree, between the encoder and the decoder, that the frac- 
tionalizing procedure will be repeated, if necessary, until the new count sum C 
satisfies C < M/4+2. Recall from Section 6.4 that this inequality guarantees 
that every source letter will be allocated a non-empty integer interval, in the 
subdivision of the current interval in the algorithm of that section. 

Adaptive Huffman and adaptive arithmetic encoding have been regarded as 
approximately equivalent in effectiveness and cost, in the past. Current gossip 
has it that adaptive arithmetic now has an edge over adaptive Huffman encod- 
ing, because implementation of arithmetic coding has been improved lately, 
while tree maintenance techniques for adaptive Huffman encoding remain ap- 
proximately where Knuth left them. But the pace of technological advance is 
swift, so it may be that adaptive Huffman encoding may be the less costly of the 
two, by the time you read this. Of course, many hold the view that arithmetic 
encoding of any sort has an insurmountable theoretical advantage over the cor- 
responding Huffman encoding, but it is not clear that this theoretical advantage 
persists in practice. 


Exercises 8.3 


1. Redo Exercise 6.1.1 adaptively, with all counts initially set at 1. 


2. This is about adaptive arithmetic coding using the method of Section 6.4 
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with an integer interval [0,M) = {0,..., MM —1}. Suppose that M = 32. 
Suppose the source letters are a, b, and EOF, and all source letters start with 
a count of 1. Suppose that the Gallager fraction by which the letter counts 
will be occasionally multiplied is 1/2, with rounding up, as suggested in 
the text above. Suppose that this fractionalizing of the counts will occur 
whenever the count sum rises to 11 = (M/4+2)+1. Give the current 
counts of a, b, and EOF after each source letter is read, if the source stream 
is baabbbbabbaabaaabaaaaEOF. 


——————— eee 


8.4 Interval and recency rank encoding 


Both encoding methods referred to in the title of this section were introduced 
by Elias [16], although he gives credit for the independent discovery of recency 
rank encoding to Bentley, Sleator, Torjan, and Wei. 

Both methods are lossless, adaptive methods for encoding text over a source 
alphabet S = {51,..., 5m}. 


8.4.1 Interval encoding 


In interval encoding we start with an infinite set C = {uo,u1,...} of binary 
words, satisfying the prefix condition and lgth(uo) < lgth(v;) <---. A letter s; 
occurring the source text is encoded by u;, where i is the number of source let- 
ters between this occurrence of s; and its last occurrence, in previously scanned 
source text. We get started by imagining the source text to be preceded by the 
string 5] --- Sm. 

For example, let us take C = {0, 10, 110,...} and consider the source text 
(*) in Section 8.1.2, 


$353 82568] S] 8256855] 835385568] $2828 + () 


The encoder (and the decoder, as well) knows that the source alphabet is {5),..., 
so}. The first s3 in the source text is encoded by u3 = 1110, because there are 
three letters, namely s4, 55, and s6, between it and its imaginary first occurrence, 
in the imagined block s;---s6 preceding the real source text. The next 53 is 
encoded with uo = 0. The s2 following is encoded with ug = 1111110. (Why?) 
The code for the first 6 letters of (+), 935352565151, Will be 11100111111011101900 
(with 1° standing for nine ones). Notice that if s4 ever shows up in this source 
text, its first occurrence will be represented by a frightfully long code word. 

In fact, although interval encoding intuitively seems elegant and efficient, 
it is rather disappointing in practice, with regard to compression. The source of 
the infelicity is the choice C of a set of code words, chosen without reference 
to any particular properties of the source text. You do not need to know any 
properties of the source text to do adaptive Huffman or arithmetic encoding, but 
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the encoding will reflect statistical properties of the source text discovered or 
collected as the encoding proceeds. In interval encoding, you are stuck with 
the set C chosen at the outset. Of course, there are situations involving com- 
munications in which it is a good thing for everybody in the conversation to be 
using the same code word set, so this weakness of interval encoding can also be 
a strength. 

The C we used for the example above, in which uz = 1*0,k =0,1,2,... ; 
is a particularly bad choice. It can be shown (see Exercise 8.4.3) that for any 
zeroth-order source with alphabet S = {s1,..., 5m}, the average length of a code 
word replacing a source letter in interval encoding of the source text, using 
the set C of code words above, will be m. This compares very unfavorably 
with 1+ H(S) < 1+ 1og,(m) (see Chapters 2 and 5) an upper bound on the 
average length of a code word replacing a source letter if Huffman encoding 
were possible, and thus an upper bound on the (long term) average code word 
length if adaptive Huffman encoding is used. 

Elias [16] proposes two infinite code word sets, that we will refer to here- 
after as Cy = {uo, U1, uU2,...} and C2 = {vo, v1, v2,...}, which do a much better 
job of compression when used in interval encoding than C, above, although they 
still do not give an average code word length close to H (S), for a zeroth-order 
source S. The code word ux is formed by following a string of |log,(k + 1)] ze- 
roes by the binary expansion of k + 1, which will be of length 1+ Llog,(k+1)]. 
Thus Igth(uz) = 1+2|log,(k+ 1)]. The first few ux’s are as follows: 


up =1 u4 = 00101 
u, =010 us = 00110 
uy = 011 ug = OO111 


u3 = 00100 u7 = 0001000 


The code word vz is formed by following wg), where g(k) = Llogy(k + 
1)], by the binary expansion of k+ 1. Thus 


vp = 11 v4 = 011101 
vj =01010 vs = 011110 
v) = 01011 vp = OLLI11 


v3 = 011100 v7 = 001001000 


(Verify!) It may appear that the vy, are longer than the ux, and so it is, in the 
early going, but 


Igth(vg) = 1+ [logs (k + 1) | +1gth(ug a) 
= 2+ [logy (k + 1] +2Uog,(1 + Hogs (k + 1)))J, 
which is asymptotically about half of lgth(uz,) = 1+2[log,(k+1)],ask > oo. 
It is left to the reader to verify that the sets Cj and C2 are prefix-free; i.e., 
they satisfy the prefix condition. 


It can be shown (see Exercise 8.4.4) that in interval encoding using C; on 
text from a zeroth-order source S, the average length of a code word replacing 
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a source letter is no greater than 1 + 2H (S); using C2, that average is no greater 
than 


m 
2+ H(S)+2)° fj log)(1+log,(1/f;)), 
j=l 
where the f; are the relative frequencies of the source letters. If m and H(S) are 
large, the latter is usually the smaller of the two upper bounds, but for the small 
source alphabets we use for examples, C, is the superior set of code words. 

If you do Exercise 8.4.4, you will see that, for m large and the f; small, 
the upper bounds on the average code word lengths in the paragraph preceding 
are not very pessimistic; that is, the true average length of a code word replac- 
ing a source letter is not far from the upper bound given. We leave the precise 
analysis necessary to demonstrate this to the ambitious; the point is that by in- 
terval encoding using C; or C2, we cannot hope for the compression achievable 
with adaptive Huffman or arithmetic coding applied to text from a zeroth-order 
source. 

On the other hand, the upper bounds mentioned above say that interval 
encoding using C; or C2 results in code text, at worst, no more than twice as 
long as the code text from Huffman or arithmetic encoding, on the average, 
when applied to text from a perfect zeroth-order source; and there are offsetting 
advantages, such as a common, system-wide code word set, and great ease and 
speed in encoding and decoding. 

The decoding procedure should be clear, but just to be sure that we under- 
stand it, let’s decode 


011001101001110100100001000100100, 


the result of applying interval encoding using C; to a short passage of source 
text over S = {51,52,53,84,55,56}. Recall that it is imagined that the actual 
source text is preceded by s} --- 56, to get things started. 

Scanning the string above from left to right, we first recognize uz = 011. 
(All ux in the string above are among uwo,..., 7, listed above. We will have a 
few words to say below about a more systematic way of recognizing the wx.) 
Thus the first source letter is s4. Do you see why? Because s4 is the choice that 
puts two letters between it and its previous occurrence in the preliminary block 
S186. 

After u2, we next recognize us = 00110. Counting back six places, we 
come to s2, and that is the next source letter. The next code word is up = 1, so 
s2 is repeated. Continuing in this way, you should decode the given string as 
$4S28283528385S5S2. 

Given the rules of formation of the ux, it is quite easy to program so that the 
ux can be recognized by reading-left-to-right, without checking any lists. You 
count the number of zeroes until the first 1 in the word being scanned. If there 
are t zeroes, then the next ¢ + | bits give the binary expansion of k + 1, which is 
the number of places you will count back through the source text so far to find 
the next source letter to be decoded. 
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8.4.2 Recency rank encoding 


Recency rank encoding is quite similar to interval encoding—it applies to the 
same situations and shares some of the same advantages—but requires only a 
finite prefix-free set C = {uo,...,Um—1} of m = |S| code words. Given C, an 
occurrence of a source letter s; in the source text is encoded by ux, where k is 
the number of distinct source letters that have appeared in the source text since 
the last appearance of s;. As in interval encoding, we pretend that the source 
text is preceded by s1---5,, to get things started. 

For example, with S = {s1,...,56} and C = {0,10,110,1110, 11110, 
111110}, the source text (*«) presented earlier in this section will be encoded 


1110 0 11110110111110 0 110 110111101110 11110 
(815253548556) a 


(We leave the rest of the encoding as an exercise.) Note that s4, when it first 
occurs, if ever, will be encoded 111110. 

Clearly recency rank encoding shares with interval encoding the advantage 
of a common code word set to be used by all in a communication environment. 
Clearly recency rank encoding will compress better than interval encoding, for 
any reasonable choice of the finite code word set C. (The rigorous analysis 
of recency rank encoding in this respect remains to be done, but the point is 
that the number of distinct letters in a given block of source text is certainly no 
greater than the length of the block.) In fact, the only advantage that interval 
encoding has over recency rank encoding is in the speed and ease of encoding 
and decoding: clearly it is a little more trouble to count the number of distinct 
symbols in a block of symbols, than just to count the length of the block. And in 
recency rank decoding, having scanned ux, 0 < k < m—1, you have to go back 
into the source string decoded so far until you come to the (k + 1)st different 
symbol, and clearly this involves some sorting and checking and a good deal 
more trouble than just counting back k + 1 places, as in interval decoding. But 
when the amount of trouble involved in recency rank encoding and decoding 
is compared to the corresponding difficulties in adaptive Huffman or arithmetic 
coding, that trouble does not seem like much. 

We are indebted to Greg Hanks, a student, for proposing the following: 
while encoding either by the interval method or by recency rank, keep counts 
of the original source letters and of the code words (the u,) and then, after the 
source text has been encoded, recode the whole thing by zeroth-order Huffman 
replacement of S or of C, whichever has the lesser entropy; the entropies will be 
calculated from the relative frequencies that arise from those counts you kept. 
Or, you could recode arithmetically using those relative frequencies—again, the 
recoding is applied either to the original source text or to the code text, regarded 
as a string of ux’s, depending on which alphabet has lower zeroth-order entropy. 

Now, this procedure defaces the speedy online character of the interval and 
recency rank methods; and, anyway, don’t we do adaptive encoding to avoid 
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advance statistical study of the source text? But it is an interesting idea, all the 
same, and merits some experimentation to see if any significant compression is 
achievable by such recoding. 

Further, such experimentation should provide an interesting test of faith. 
Recall the discussion at the end of Section 6.2; it is widely believed that no 
lossless “zeroth-order” method, whatever that may mean, can encode source 
text over a source alphabet S in fewer than H(S) bits per source letter, on av- 
erage, with H(S) denoting the zeroth-order entropy of the source. Now, if the 
source text is encoded by the interval or the recency rank method, each source 
letter s has been replaced by a code word u belonging to a set C of code words; 
we can regard C as a source alphabet now, and the encoded text as a new source 
text. According to the faith about entropy, and by the fact that you can encode 
text in entropy-plus-epsilon bits per symbol by indisputably zeroth-order loss- 
less methods (see Sections 6.2 and 5.4) either (a) it must be that H(C) > H(S); 
otherwise, if H(C) < H(S), we could encode the text over C, and thus the orig- 
inal source text, in fewer than H(S) bits per symbol; or (b) the coding method 
is not zeroth-order. There is a good argument for this latter assertion, because 
in both interval and recency rank encoding, the encoding of each letter s has 
something to do with the context—either the number of letters since s’s last ap- 
pearance, or the number of different letters since that appearance. Perhaps this 
objection is sufficient to preserve the faith. In any case, it would be interesting 
to see if we get H(C) < H(S) in many plausible “real” situations. When we do, 
Hanks’ suggestion provides a way of achieving better compression than could 
be had by plain zeroth-order Huffman or arithmetic coding of the source text, 
were the relative source letter frequencies known. 

In the silly situation considered at the end of Section 7.4, where S = {a,b, 
c,d} and the source text consists of abcd repeated over and over, all relative 
frequencies are 1/4, so the zeroth-order entropy is H (S) = log, 4= 2. Encoding 
either by interval or by recency rank, every source letter gets replaced by u3 € C; 
thus the entropy of C, considered as the source alphabet of the resulting text, 
is H(C) = 0. Using Shannon’s trick, as described in Section 5.4, and encoding 
blocks of N of those u3’s by a single bit, we can encode the original text at the 
rate of 1/N bits per original source letter. 

True believers (and we are among them, we just wonder what it is we be- 
lieve) will say that this is evidence either that this is a lousy example, or that 
interval and recency rank encoding are not zeroth-order methods. We tend to 
the latter view, without absolutely ruling out the former. 

One last comment; should it happen that experiment shows that we fre- 
quently have H(C) < H(S) after interval or recency rank encoding, then the 
idea of Greg Hanks might profitably be put into practice with preservation of 
the online, no-second-pass character of those coding methods, simply by imme- 
diately following the interval or recency rank encoding with adaptive Huffman 
or arithmetic coding, applied to the symbol set C = {uo,u1,u2,...} of the first 
encoding. 
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Exercises 8.4 


1. (a) Encode source sequence (*), above, by interval encoding using the set 
C, of code words defined in the text. 
(b) Complete the encoding of (*) by recency rank encoding, using C = 
{0, 10, 110, 1110, 11110, 111110}. 


2. (a) Supposing S = {s1,..., 56}, decode 
001010100101001 1000010000101 


assuming the encoding was by the interval method, using C1. 
(b) Supposing S = {s1,..., 56}, decode 


11101011100111101101011100011110, 


assuming the encoding was by recency rank, using C = {0, 10, 110, 
1110, 11110, 11110}. 


3. Before getting to the question, we need some observations. 


(i) Recall the formula for the sum of a geometric series: )~?°29 p* = 
(1—p)~!, for |p| <1. 

(ii) Differentiating both sides of the equation in (i) with respect to p, we 
obtain 77°, kok“! = (1p). 

(iii) If a symbol s from a perfect zeroth-order source occurs in the source 
text (randomly and independently of all other occurrences) with relative 
frequency f, then, starting from any point in the source text and going 
either forward or backward, assuming the source text extends infinitely in 
both directions, the probability of reading through exactly k letters before 
coming to the first occurrence of s (at the (k + 1)st place scanned) is f(1 — 
f)*. (Thus the average gap between occurrences of s in the source text is 
Dicokf(l— fF = fA- fe kd - fy t= fd-f\F =F 1, 
using (ii). To put it another way, s occurs on average once every |/f letters, 
which agrees with intuition, since f is the relative frequency of s.) 


(iv) Suppose that S = {s1,..., 5} is the alphabet of a perfect zeroth-order 
source, with s; having relative frequency f;, | < j <_m. Suppose the 
source text is encoded by the interval method, using some prefix-free set 
C = {wo, w1,...} of code words. Then the average length of a code word 
replacing s; will be £; = ah fj;a- xi" Igth(w,), so the average length 
of a code word replacing a source letter will be € = viel fej = 
Dia Sj Deol — fj)" Igth(we). 

Finally, the problem. Show that for any zeroth-order source, in interval 


encoding using C = {0,10,110,---}, the average length of a code word 
replacing a source letter will be m = |S]. 


4. To get at the average length of a code word replacing a source letter in in- 
terval encoding using the code word sets C; or C2, we need to recall some- 
thing about concave (some say, concave down) functions which played a 
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role in the proof of Theorem 5.6.3. If i is a concave function defined on 
an interval, x;,x2,... are points in that interval, and A1,A2,... are non- 
negative numbers summing to 1, then 0, A;A(xi) < ACYL; Aixi). (In some 
treatments, this is true by the definition of concave functions. Whether def- 
inition or theorem, we take it as a given fact.) If h is continuous as well, 
then this inequality holds for infinite sums, provided }°, A;x; converges. 


(a) Show that if text from a zeroth-order source with alphabet S = {s,, 
..., 5m} and relative source frequencies f|,..., fm is encoded by the 
interval method using Cj, then the average length of a code word 
replacing a source letter is < 1+2H(S), where, as usual, H(S) = 
ei fj logy (1/f;). [Use (iv) in problem 3, above, and the fact that 
log, is concave. ] 

(b) Show that if Cz is used in interval encoding, the average number of 
bits per source letter is < 2+ H(S)+ 2a fj logy + logy 1/fj)). 
[Verify that h(x) = logy (1 + log, x) is concave on [1, oo) by taking its 
second derivative. Proceed as in (a).] 
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Dictionary Methods 


In the previous chapters, lossless compression was obtained by using a proba- 
bility model to drive a coder. Dictionary methods use a list of phrases (the dic- 
tionary), which hopefully includes many of the phrases in the source, to replace 
source fragments by pointers to the list. Compression occurs if the pointers re- 
quire less space than the corresponding fragments. (Of course, the method of 
passing the dictionary must also be considered.) 

In many ways, dictionary methods are easier to understand than probabilis- 
tic methods. At the simplest level, several (fixed) specialized dictionaries could 
be made available to both the coder and decoder. For text in English, a few 
thousand of the most commonly used words could serve as the dictionary; if the 
source consisted of code in some computer language such as C, then a list of 
the keywords and standard library functions might serve as a dictionary. Fixed 
dictionaries may be useful in some situations, but there are at least two serious 
drawbacks. First, the dictionaries must be known to both the coder and decoder. 
Changes to the dictionary would have to be propagated to all the sites which 
use the scheme. Second, fixed dictionary schemes cannot compress “unknown” 
text. In the case of C code, there would likely be little compression of the vari- 
able names created by the programmer. 

Our main interest here is methods which adapt to the source; that is, meth- 
ods which build the dictionary from the source, and which usually do this on- 
the-fly as the source is scanned. Communication via modem commonly uses 
such a scheme (V.42bis). Fixed dictionaries would be of little use for general- 
purpose communications, and, in addition, on-the-fly dictionary creation is per- 
haps essential if the session is interactive. 

Adaptive! dictionary methods can often be traced to the 1977 and 1978 
papers by Ziv and Lempel [85,86]. The general schemes are known as LZ77 
and LZ78, respectively. Applications employing variations on LZ77 include 
LHarc, PKZIP, GNU zip, Info-ZIP, and Portable Network Graphics (PNG), 
which is a lossless image compression format designed as a GIF successor.” 
LZ78-type schemes are used in modem communications (V.42bis), the Unix 
compress program, and in the GIF graphics format. 

'The use of “adaptive” in the literature has not always been consistent. See Langdon and Rissa- 
nen [44] or Williams [82] for some discussion. 


2See http://www.clione.co.jp/clione/lha, http://www.pkware.com, http://www.gnu.org, http:// 
www.info-zip.org, and http://www.libpng.org, respectively. 
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The basic difference between LZ77 and LZ78 is in the management of 
the dictionary. In LZ77, the dictionary consists of fragments from a window 
(the “sliding window’’) into recently seen text. LZ78 maintains a dictionary of 
phrases. In practice, LZ77 and LZ78 schemes may share characteristics. There 
are distinct advantages of each scheme: roughly speaking, LZ78 uses a more 
structured approach in managing a slow-growing dictionary (possibly trading 
compression for speed at the coding stage), and LZ77 has a rapidly changing 
dictionary (which may offer better matches) and is faster for decoding. In ap- 
plications, the choice of basic scheme may be complicated by various patent 
claims (see Appendix C). 

If dictionary methods are both simple and popular, the reader may be won- 
dering why they’ve been presented after the probabilistic methods. Part of the 
reason is historical, but it should also be noted that, subject to fairly modest re- 
strictions, the compression achieved by a dictionary method can be matched by 
a Statistical method (see Section 9.3). However, dictionary methods continue to 
be very popular due to their simplicity, speed, relatively good compression, and 
lower memory requirements compared to the best statistical methods. 

A combination of dictionary and probabilistic schemes is possible. An 
example is provided by the GNU zip program discussed in Section 9.1.2, which 
uses a Statistical method on the output of the dictionary coder. 


9.1 LZ77 (sliding window) schemes 


In the basic scheme, a two-part window is passed over the source: 


history lookahead 


...She slells sea shells by the seashlore... 


09876543210987654321 


In the simplest case, the history and lookahead are of fixed length, and the 
dictionary consists of all phrases (that is, fragments of consecutive characters) 
which start in the history and which are no longer than the length of the looka- 
head.*? With such a dictionary, it is convenient to think of the history as the 
dictionary; however, we will exhibit schemes where these differ. Typically, the 
history is much longer than the lookahead. 

The idea is to replace an initial segment of the lookahead with a pointer 
into the dictionary, and then slide the window along. In the example, the ini- 
tial segment ‘he’ matches the second two characters of the dictionary phrase 
‘shell’ (at an offset of 10 into the dictionary, counting right to left). The orig- 
inal scheme would output a triple (offset, length, character), where the third 
component is the next character from the source (the “unmatched character’), 
and then the window is moved: 


31t is understood that the source symbols are also included in the dictionary. 
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history lookahead 


...She sell|s sea shells by the |seashore]... 


09876543210987654321 


In this example, the triple is (10, 2, L]), where ‘L]’ represents the space character. 
Compression is achieved if this triple requires fewer bits than the three symbols 
replaced. 

Sending the unmatched character in the triple allows the scheme to proceed 
even in the case of no match in the history. However, it is sometimes wasteful 
in the case that the character can be part of a match at the next stage. This 
occurs in the example, with ‘Llsea’ matching the dictionary, and it is common 
for LZ77 schemes to look for this match. Conceptually, this means that the 
output consists of two kinds of data, rather than triples: (offset, length) pairs, 
and single (unmatched) characters. The following diagram shows a few steps in 
the process for the example problem: 


history lookahead output 


pIIs sea shells Fy the seashore (10,2) 
‘She selfis sea shells by the| seashore. (18,4) 
sells slea shells by the sealshore...|... (17, 2) 
ells sea] shells by the seashfore.....|... ) 


09876543210987654321 


...She s 


The decoder receives the output tokens, from which it can reconstruct the 
source (by maintaining the same dictionary as the coder). There are several 
observations which can be made concerning this type of scheme: 


e The decoder must be able to distinguish between ordered pairs and char- 
acters. This implies that there will be some overhead in transmitting an 
unmatched character, and hence the scheme can cause expansion (this 
should not come as a surprise). 


e The compression achieved by transmitting an ordered pair depends on the 
match length and the sizes of the dictionary and lookahead. Too short a 
match length will cause expansion (in which case it may be desirable to 
transmit an unmatched character). 


e The match can extend into the lookahead. As an example, suppose the 
window contains the fragment 


history lookahead 


ablaba 


21 


A match for the lookahead is ‘aba’, beginning at offset 2 into the history 
and extending into the lookahead. 


e Ateach stage, greedy parsing was used: the dictionary was searched for 
a longest match for the initial segment. There is no guarantee that greedy 
parsing maximizes compression. It would be preferable if the scheme 
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could search for the “best” combination of ordered pairs and characters, 
but this is intractable. A very limited form (lazy evaluation) of this is 
discussed below. 


Searching for the longest match in the dictionary could be expensive. A 
number of LZ77-variants maintain a dictionary (and structures to speed 
searching) which includes only some of the phrases in the history, thus 
limiting the amount of searches. This may result in less compression: 
in the example, instead of the 4-character match at the second stage, we 
could have matched ‘L]s’ with the the first two characters of ‘Llshells’ 
from the dictionary. 


An attractive feature of many of the LZ77-variants (other than their sim- 
plicity) is fast decoding: while the coder must do the hard work of finding 
matches, the decoder need only do simple lookups to rebuild the source. 


e The output of the coder could be subject to additional compression. As 
an example, suppose a fixed number of bits are used to store the length 
component of the ordered pair. If short match lengths are more common, 
then a probabilistic scheme on the match lengths may be effective. 


Among schemes of this form, there is considerable flexibility in choosing 
the sizes of the history and lookahead, and in management of the dictionary. 
To understand the process and considerations more clearly, some notes on two 
specific implementations of LZ77-type schemes are presented. The first of these 
is a revised version of the LZRW1 scheme proposed by Ross Williams in [83], 
and illustrates design decisions favoring speed over compression. The second 
is the well-known GNU zip (gzip) utility, which is similar to LZRW1 but uses 
more advanced techniques in managing a larger dictionary and lookahead. 


9.1.1 An LZ77 implementation 


Suppose the symbol set consists of the 256 8-bit characters. The history and 
lookahead are both fixed-length, with offsets represented in 12 bits (giving a 
history of length 2'* = 4096 bytes) and match lengths in 4 bits. A single “con- 
trol bit” is used to distinguish (offset, length)-pairs from single characters in the 
output stream. 

The cost of transmitting an (offset, length)-pair and control bit is then 17 
bits, while 2 characters cost 18 bits. For this reason, we transmit a pair only if 
the match length is at least 2. Accordingly, the 4 bits used for the length will 
represent lengths from 2 to 17. Note that expansion occurs whenever a single 
character is output, or if a pair is output with match length 2. 

As an illustration, the phrase from the previous section is passed through 
the coder. The portions replaced by (offset, length)-pairs are underlined: 
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She sells sea shells by the seashore 
h i] (6,3) 
(14,2) A (4,2) | 


Nr N (11,4) (24,5) 
i (17,2) 


The original string required 36 - 8 = 288 bits. In the encoded stream, a total of 
18 characters are replaced by 6 pairs (leaving 18 unmatched characters to send), 
giving an encoded stream of length 6- 17+ 18-9 = 264 bits. It should be noted 
that the example has been chosen so that matches against the dictionary occur 
almost immediately—typically, more of the source must make its way into the 
dictionary before many matches occur. 

The parsing has been greedy: at each stage, the dictionary is searched for a 
longest match to an initial segment of the lookahead. Here, we’ve assumed that 
the dictionary includes every phrase of length 2—17 which begins in the history. 

The LZ77 variation described is known as LZSS,* and an implementation 
with well-documented source in the C programming language appears in Nel- 
son and Gailly [53]. A tree structure is placed over the history, reducing search 
times but adding to the complexity and storage requirements. Even so, an ex- 
haustive search for the longest match among every phrase of the history may 
be prohibitively expensive. Both LZRW1 and GNU zip limit the search for 
repeated strings in order to improve speed at the coding stage. 


LZRW1 


The LZRW1 algorithm was presented by Ross Williams in [83]. Its main design 
goals favored speed over compression, and the result was a very compact and 
fast method. The sources from a revised version appear in Appendix B. 

As above, the history and lookahead are fixed-length, with sizes represented 
in 12 and 4 bits, respectively. However, the dictionary includes only a subset 
of the phrases which start in the history, and a hash function is used to speed 
searches. More specifically, the hash is a function of the first three characters 
of the lookahead, pointing to the last occurrence of a 3-character string with the 
same hash. If a match occurs, then this offset is used in the output. This can 
provide very fast matching, but it significantly reduces the size of the diction- 
ary. 

Use of the hash function means that only match lengths of at least 3 will 
be considered. Accordingly, the 4 bits used for the length will represent lengths 
from 3 to 18.° This allows a longer match-length than that used above (18 bytes 
instead of 17); however, the overhead in representing characters and pairs hasn’t 
changed, so the inability to code a 2-character sequence as a pair may degrade 
compression (which occurs in the “She sells...” example above). 


4See Storer and Szymanski [75] and Bell [7]. 
Sit is essential to identify the dictionary in LZ schemes. Appendix B describes the use of the 
hash in LZRW1 more completely, and from this the dictionary can be determined. 


6™ LZRW1, match lengths were limited to 16. This was mostly an oversight, and was corrected 
in the LZRW1-A algorithm. 
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Table 9.1: Compression (% remaining) for selected Calgary corpus files. 


(12,4) (13,4) (13,5) 15,4) 
File Kbyte | RWI LZSS LZSS LZSS_ LZSS_— gzip compress 
bib 109 


bookl 751 
geo 100 


obj! 21 
pic 501 
proge 39 
60.0 52.1 49.2 49.6 48.2 38.6 47.8 


LZRW1 obtains moderate compression using few resources. Table 9.1 
gives compression results on a subset of the “Calgary corpus” [8]. The columns 
for LZSS are tagged with pairs indicating the number of bits used to represent 
history and lookahead sizes. For reference, two well-known dictionary-type 
implementations are included: gzip (an LZ77-type scheme) and compress (an 
LZ78 scheme). The speed and simplicity of LZRW1 comes at a price: the 
compression with LZSS is generally superior (even with the same history and 
lookahead sizes). The difference is perhaps not as large as we might have ex- 
pected, given the very minimal dictionary searching used by LZRW1. 

In applications, it may be acceptable to sacrifice some speed during coding 
in order to improve compression. For example, on-line documentation may 
be viewed frequently—the speed of compression may be less important than 
amount of compression and the speed of decompression. LZ77 schemes such 
as LZRW1, LZSS, and that used in gzip all have very fast decoding due to the 
way the history is used as the dictionary. If this feature is to be retained, then 
the following are perhaps the simplest modifications to improve compression. 


Enlarge the history (and dictionary) and/or lookahead. However, more 
bits will be needed for (offset, length)-pairs, and this can increase the breakeven 
point. In our example, 12 bits were used for the offset and 4 bits for the length 
(and | bit for control). The breakeven point is just over 2 characters. If, say, a 
pair requires 15 bits and 8 bits, respectively, then the breakeven is 3 characters; 
i.e., no compression will be achieved unless the match length is at least 4. 

Recall also that increasing either of these can greatly increase search times. 
The use of a hash function in LZRW1 is fast, but LZSS finds longer matches. 
A compromise might involve the use of a hash chain to follow (some of) the 
matches in the history. 


Improve the parsing. We’ve been greedy: there has been no lookahead for 
the best combination of literals and matches. For our example, greedy parsing 
was used at the stage 


She sells sea slhells by the seashore 
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> 


and the next output items were the pairs corresponding to ‘he’ and ‘11lsL’, 
respectively, for a total of 17+ 17 = 34 bits. For this fragment, it would be 
preferable to use lazy evaluation: send ‘h’ as a literal and then send the pair for 
‘ellsL]’ using a total of 9+ 17 = 26 bits. 


Compress the output of the coder. The output of the coder has been de- 
scribed as a mix of literals and (offset, length)-pairs. If, say, short match lengths 
are more common, then a statistical coder may be able to compress the lengths. 
Of course, this modification may somewhat increase the work for the decoder. 


The “gzip” column in Table 9.1 is of special interest. It uses an LZ77-type 
scheme of the same basic form as LZRW1 and LZSS, but typically offers supe- 
rior compression. To some extent, it implements all three of the modifications 
mentioned above. 


9.1.2 Case study: GNU zip 


The gzip program is widely used as a general purpose compressor for files (or 
streams of data), and was designed as a replacement for the compress utility 
(which uses a patented LZ78-type scheme). Sources for gzip can be found in 
the references listed in Appendix C. 

As GNU Project software, it was essential to have a patent-free scheme 
with freely distributable sources. Design goals included portability and accept- 
able “worst case” performance. The history of the development suggests that 
compression performance was of less importance (but perhaps decompression 
speed was essential). Ross Williams’ LZRW1 met these conditions, and was to 
have been used as the basic scheme in gzip.’ To the dismay of Williams, it was 
discovered that (use of) the algorithm was covered by patent (see Appendix C). 

The “deflation” compression method used in the current gzip shares many 
of the features of LZRW1. It cannot match the speed of coding, but it generally 
gives more compression and faster decompression than that obtained with the 
compress utility (a common reference for dictionary schemes). As noted above, 
LZRW 1 gives somewhat less compression than compress. 

The algorithm in gzip is LZ77-type, with 15 bits reserved for the offset 
(giving a 32K history) and 8 bits for the length. An (offset, length)-pair then 
requires 23 + | bits, so the breakeven is 3 characters. Since literals cost 9 bits, 
only matches of length 3 or more are worth considering. Hence, the 8 bits 
represent match lengths of 3 to (28 — 1) +3 = 258 characters. 

A hash of the first three characters from the lookahead is used to speed 
searches into the history. Unlike LZRW1, a hash chain is maintained in order to 
permit searching for longer matches. Searching through the chain is expensive, 
but it is performed only during coding. 


7Confirmed via private email with Jean-loup Gailly, the principal gzip developer. Quoted by 
permission. 
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Lazy evaluation is used: after finding the longest match at the current sym- 
bol, gzip looks for a longer match at the next symbol. If a longer match is 
found, the current symbol is sent as a literal, and the process continues at the 
next symbol.® 

To be precise, the lazy evaluation and the search through the hash chain are 
subject to runtime choices (the ‘-0’ to ‘-9’ compression level options). Several 
parameters are set, including: 


good_length: If the current match has length at least good_ength, then the 
maximum depth for a lazy search is reduced (divide max_chain by 2). 
max_lazy: Do not perform lazy search above this match length. 
nicellength: Quit search above this match length. 
max_chain: Limit on depth of hash chain search. 
Example choices for the parameters appear in Table 9.2. With the exception 
of the step between compression levels 3 and 4, the parameter values increase 


with the compression level. It seems reasonable to expect that larger values 
should correspond to improved compression, but there is no guarantee of this. 


Table 9.2: Parameter values corresponding to gzip compression level. 
Compression level 


Parameter 4 8 
good_length 4 8 32 32 


max_lazy” 4 16 128 258 
nice_length 8 32 16 128 258 258 
max_chain 4 32 16 128 1024 4096 


“Default value. 
>No lazy search on levels 0-3. A fifth parameter (max_insert_length) limits the updat- 
ing of the hash table (for speed). 


The output of the “dictionary-scheme” portion of gzip may be compress- 
ible. This is perhaps easiest to see in the match lengths—short matches may 
be much more common than longer matches. A second “back-end” compressor 
is used which compresses literals or match lengths with one Huffman tree, and 
match offsets with another. 

Performance results for various choices of the compression level can be 
seen in Table 9.3 on page 243. These tests were run on a SPARCstation 20, 
with timings obtained by averaging several blocks of 20 runs. Included are 
results from the Unix compress program, which uses an LZ78 scheme and has 
become a standard reference for dictionary methods. 


8The notes in ‘algorithm.doc’ in the gzip-1.2.4 distribution may be misleading on this point; 
however, the actual source code explains it well. 
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Exercises 9.1 
1. Williams’ paper [83] contains the following bit of poetry: 


A walrus in Spain is a walrus in vain 


(a) Find the (offset, length)-pairs produced by the LZRW1 scheme. Indi- 
cate the text fragment corresponding to each pair. 


(b) Calculate the compression (or expansion) in this example. 


2. Lazy evaluation can do worse than greedy parsing. Consider an LZSS-type 
scheme where offsets are represented in 12 bits and lengths are represented 
in 4 bits with match lengths from 2—17. The use of a control bit means that 
an (offset, length)-pair requires 17 bits and a literal requires 9 bits. Suppose 
the current window contains 


history lookahead 


abcbcdede fgjabede fg 


Show that greedy parsing leads to 34 bits output, while lazy evaluation 
results in 43 bits. 


3. gzip searches the hash chains so that the most recent strings are found first, 
and matches of length 3 which are too distant (more than 4K bytes) are 
ignored. How does this help compression? (Hint: Consider the back-end 
Huffman processing.) 


4. (Programming exercise) Choose a set of test files and determine if the lazy 
evaluation of gzip is effective (consider both time and compression). It will 
be necessary to modify the sources of gzip so that the parameter values 
(other than lazy evaluation) remain the same for tests with and without lazy 
evaluation. 


5. The hash function used in LZRW1 can be found in Appendix B. Knuth [39] 
writes that such functions should be quick to compute and should minimize 
collisions. Does the choice in LZRW1 satisfy these criteria? The constant 
40543 which appears in the definition is prime, but is there any other reason 
for its choice? (You may wish to consult Knuth’s book.) 


—==SSS===__————— 


9.2 The LZ78 approach 


The LZ77-schemes discussed in the previous section are attractive for their sim- 
plicity, relatively good speed and compression for the resources required, and 
fast decoding. An exhaustive search through the history during coding can be 
expensive, but implementations such as LZRW1 and gzip illustrate methods to 
trade compression for speed. Another possible concern with the scheme as pre- 
sented is that only “recently-seen” strings can be matched. The history (and/or 
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Input: pababa| — febbabal —_—fababpiba 


Trie: #05 TORS, a RS a 
a| ya aN 
#1 ae? ae? 
| 
8 
Output: (#0, a) (#0, b) (#1, b) (#3, a) 


Figure 9.1: LZ78 coding on ‘abababa’. 


lookahead) could be enlarged (and offsets could be represented with variable- 
length pointers), but this can add complexity and search time to the scheme. 

LZ78 takes a different approach in building a slow-growing dictionary. The 
source is parsed into phrases, and a new phrase enters the dictionary by adding a 
single character to an existing phrase. In practice, there is a ceiling on the num- 
ber of phrases, and some action is performed when the dictionary fills. There 
is a price for this “more structured” approach: the decoder must maintain the 
dictionary.” 

In its basic form, LZ78 starts with an empty dictionary (denoted by ‘#0’ in 
Figure 9.1). At each stage the longest match for the lookahead is sought from 
the dictionary, and a pointer #n to this phrase, along with the “unmatched” 
character c, is output as the ordered pair (#n,c). The dictionary is updated, 
adding c to the phrase represented by #n. The decoder must maintain the same 
dictionary of phrases. 

To see how this works, consider encoding ‘abababa’. The top row of Fig- 
ure 9.1 shows the source, with a vertical line separating the history from the 
lookahead. The second row illustrates the updating of the dictionary trie.!° 
Phrases corresponding to a pointer #n are found by walking up the tree: for 
example, phrase #3 is ‘ab’. At each stage, the dictionary is traversed for the 
longest match against the lookahead, and then the phrase number and unmatched 
character are output (the last row in the figure). The notation ‘#n +c’ in the 
dictionary update means that the character c is to be added to the the phrase 
represented by #n. 

Figure 9.1 shows that four steps are needed to encode the sample source, 


°The LZ77 decoder must maintain the history; however, for the schemes described, this is con- 
siderably less than maintaining the LZ78 dictionary. 

10The name “trie” was suggested by E. Fredkin, according to Knuth [39]. In a footnote to Fred- 
kin’s paper (Trie memory, Communications of the ACM, 3(9):490-499, September 1960), an editor 
remarks that “trie is apparently derived from retrieval.” Knuth describes a trie as essentially an 
n-ary tree, whose nodes are n-vectors. Each node represents the set of keys that begin with a certain 
sequence of characters; the node specifies an n-way branch, depending on the next character. 
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Input: (#0,a) (#0, b) (#1, b) (#3, a) 
Trie: 40 "Ota yo #040) yo _#L+D 40 
a| aN 7 
#1 a0 400 OD 
| 


— 
o<— 
—— 
stk 
Ww 
<—_ 


Output: a ab aba 


Figure 9.2: LZ78 decoding on the output of Figure 9.1. 


and the output consists of the ordered pairs (#0, a), (#0, b), (#1, b), and (#3, a). 
The dictionary at the next stage would consist of the following phrases: 


Dictionary trie Corresponding phrases 


#0 
TAN 
#1 #2 


#0 null 
#1 
#2 
#3 
#4 


Note that phrases always start at the root. For example, the string ‘ba’ appears as 
part of ‘aba’ in the trie, but it is not a phrase in the dictionary. Unlike LZ77, the 
length is not passed as part of the pointer—the length of a phrase is understood 
from the trie structure. Also, the possibility of || children at a given node can 
make for a more complicated trie structure in the case of a larger symbol set S. 

Decoding the output of Figure 9.1 is very simple, and consists of revers- 
ing the vertical arrows. Since we wish to compare this carefully with a modi- 
fied scheme, the decoding is shown in detail in Figure 9.2. The last row gives 
‘abababa’ as the recovered string, as expected. 

In implementations, the dictionary will eventually fill. Indeed, LZ78 vari- 
ants such as Unix compress and the V.42bis method commonly used in com- 
munication via modem have relatively low ceilings (64K and 2K, respectively) 
on the total number of phrases. There have been many schemes proposed for 
handling overflow: 


b| 


e Freeze the dictionary at this stage. Compression can suffer if the nature 
of source changes after the dictionary is frozen. 


e Flush the dictionary and start over. This can discard much of what was 
learned about the source. 


e Monitor compression, and flush the dictionary only when performance 
falls below a threshold. This is the approach in Unix compress. 
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e Prune the trie. This could perhaps be a “remove least recently used 
phrase” scheme. V.42bis and the ‘shrink’ method of PKZIP 1.0 use prun- 
ing. 

Many variants of the basic LZ78 scheme have been described. One of these, 
known as LZW [80], drops the explicit transmission of the unmatched character. 
This variant is the basis for the method used in Unix compress, V.42bis modem 
compression, and the GIF graphics format. Unisys currently holds a patent 
on some portions of the algorithm, and sells the license for use in modems 
supporting V.42bis. In 1995, Unisys announced that it would start pressing its 
claims in connection with the GIF graphics format. As Greg Roelofs wrote, 
“GIF became decidedly less popular right around New Year’s Day 1995 when 
Unisys and CompuServe suddenly announced that programs implementing GIF 
would require royalties, due to Unisys’ patent on the LZW compression method 
used in the GIF format.”!! Since PNG offers technical advantages over GIF, it 
is likely to receive considerable attention. 


9.2.1 The LZW variant 


In the basic LZ78 scheme described above, the output of the coder consists of 
a sequence of ordered pairs (#n,c), where #n is a pointer into the dictionary 
and c is the unmatched character from the search. LZ78 is said to have pointer 
guaranteed progress through the source, since c (known as the innovation or 
instance) is part of the output token. Explicit transmission of c may be wasteful 
in the case that c could be part of a match at the next stage. This same consid- 
eration motivated the development of the deferred innovation variation (LZSS) 
of LZ77. 

The idea in LZW is to completely drop the transmission of characters c. 
The dictionary updating process remains essentially unchanged, but the pro- 
gression through the source will differ from LZ78 (resulting in a different trie). 
LZW is said to have dictionary guaranteed progress through the source. In or- 
der for this to work, the dictionary is preloaded with the 1-character phrases 
from the symbol set. 

As an example, the LZW scheme is applied to the string ‘abababa’. The 
dictionary starts with entries for the 1-character phrases ‘a’ and ‘b’. At each 
stage, the longest match for the lookahead is found, and then the unmatched 
character c is added to #n, giving a new phrase #n + c. The procedure is illus- 
trated in Figure 9.3. 

The process looks quite similar to that in the LZ78 scheme of Figure 9.1. 
However, we obtain a different collection of phrases, and apparently some com- 
pression since only four pointers are output (rather than the four pairs of Figure 
9.1). The differences are more dramatic in the decoding process, and this ex- 
ample shows an exceptional case in LZW which must be handled. 


'1 Quoted from the Portable Network Graphics (PNG) page at http://www.wco.com/png/. Used 
by permission. A short history of PNG appears in [60]. 
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Input [zbababa 


Trie: #40 _#!+) eae | A) 
a vy a vo a vi Me a if ? 
#1 #2 #1 #2 #1 #2 #1 #2 
b| b | b| | 
#3 #3 #4 #3 #4 
a| 
| | ” | 
Output: #1 #2 #3 #5 
Figure 9.3: LZW coding on ‘abababa’. 
Input: #1 #2 #3 #5 
Trie: #0 a, SE gg #0 
a Ae a i i a i a a if <o 
#1 #2 #1 #2 #1 #2 #1 #2 
| o| a 
| | #3 | #3 | #4 
Output: a b ab aba 


Figure 9.4: LZW decoding on the output of Figure 9.3. 


In the LZW decoding of Figure 9.4, the updating of the dictionary is indi- 
cated with expressions of the form #n + #m, where the new subscript indicates 
that the first character corresponding to phrase #m is to be added to phrase #n. 
For example, the dictionary update #1 + #29 adds ‘b’ (the first, and only, char- 
acter of phrase #2) to ‘a’ (the phrase corresponding to #1), giving a new phrase 
‘ab’ (phrase #3). 

The last column of Figure 9.4 requires some explanation, since phrase #5 is 
not even listed in the trie (so how did we know the output is ‘aba’?). This is the 
exceptional case in LZW, and can be resolved by noting that the update of the 
dictionary occurs one step later than in LZ78. The next update of the dictionary 
would be #3 + #50, so the new phrase #5 satisfies #5 = #3 + #59. This implies 
that phrase #5 begins with the first characters of phrase #3; e.g., #59 = #30. 
Hence, #5 = #3 + #59 = #3 + #39 = ‘aba’. 

Looking back at the coding stage, it can be seen that the exceptional case 
occurs when the newest node at a given stage is used as the output. This occurs 
in the last column of Figure 9.3. This case could be avoided (possibly resulting 
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in less compression) if the newest node at a given stage is not considered for 
output. In the example, the coder could have split the match into #3 and #1. 


9.2.2 Case study: Unix compress 


The compress program uses an LZW variant called LZC. The utility compresses 
single files (or streams of data). Versions exist across many platforms, and it 
has become a standard reference when comparing compression schemes. It has 
relatively good performance, given the resource requirements. 

The scheme uses pointers of variable size to tag dictionary entries, starting 
with 9 bits and increasing up to a ceiling of 16 bits (corresponding to a diction- 
ary with 2'© — 64K entries). An option allows setting a lower ceiling, typically 
so that files can be uncompressed on small machines. 

The dictionary will eventually fill. In this case, compress monitors com- 
pression, flushing the dictionary when performance drops. This is a simple 
scheme to implement, and gives compress the ability to adjust to changes in the 
nature of the data after the dictionary fills. Automatically clearing the diction- 
ary on a periodic basis has a similar goal, but may be wasteful if the dictionary 
is performing well. 

Compress can perform rather badly on random data, since the output at 
each stage consists of a pointer of 9-16 bits (corresponding to an expansion of 
9/8 to 2 for a single character match). The parsing of LZ78 with a preloaded 
dictionary may do better in this special case, since tokens would represent at 
least 2 characters of the source (corresponding to an expansion of 17/16 to 3/2 
for a single character match if compress were minimally modified to use an 
LZ78 approach). 

The results are mixed: compress gives better compression than LZRW1, but 
perhaps not as much as expected, given the very minimal resource requirements 
of LZRW1. Table 9.1 shows a sample from the Calgary corpus. LZRW1 “com- 
presses about 10% absolute worse than [compress], but runs four times faster’ 
in the tests run by Williams [83]. Note that LZRW1 actually beat compress on 
“objl’ (VAX object code). 

The GNU zip (gzip) utility discussed in Section 9.1.2 was designed as a 
replacement for compress, and Table 9.3 gives performance results on several 
of the Calgary files. The tests were performed on a SPARCstation 20, with tim- 
ings obtained by averaging the results from the Unix time command on blocks 
of 20 runs. The decode rates are based on the size of the original file. As ex- 
pected, compress is faster than gzip for compression, but considerably slower 
on decoding. 

Although gzip can only look at the most recent 32K of history, the diction- 
ary contains many more entries than the 64K phrases maintained by compress. 
For the test files in Table 9.3, gzip gives superior compression, even at the low- 
est (fastest) setting. The increased compression at the higher levels comes at a 
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Table 9.3: Performance of gzip vs compress. 


Encode (10K/s) [compression (% remaining)] Decode (10K/s) 

compress gzip (-1) gzip (-3) = gzip (-6) | compress gzip 
bib 109 58 [41.8] 51 [39.4] 39[35.7] 18 [31.5] 
bookl 751 42 [43.2] 42 [47.5] 28 [43.8] 12 [40.8] 


geo 100 45 [76.0] 27 [68.2] 14 [67.9] 6 [66.9] 


obj 21 | 44[65.3]  37[49.8] 34 [49.2] 23 [48.0] 
pic 501 | 125 [12.1] 101 [12.8] 84[12.2] 36 [11.0] 
proge 39] 52[48.3]  46[39.0] 38[36.6] 22 [33.5] 


70 [47.8] 51 142.8] 39 [40.9] 20 [38.6] 98 «152 


rather steep price, due to the more exhaustive searches (and lazy matching at 
levels above 3). 


Exercises 9.2 


1. LZ78 coding produced the pairs: (#0, M), (#0, i), (#0, s), (#3, i), (3, s), 
(#2, p), (#0, p), (#0, 1). Decode this to obtain ‘Mississippi’. Show the final 
dictionary obtained. 


2. Encode ‘Mississippi’ using an LZW scheme with symbol set {M, i, p, s} 
and initial dictionary 


#1 
#2 
#3 
#4 


Show the final trie obtained. 
3. The symbol set {a,b} was used for LZW encoding, with initial dictionary 


#1 a 
#2 b 


Decode the sequence #1, #2, #4. Explain your steps. 


4. Some dictionary coders can be converted into a statistical model which 
gives the same compression. The code space used by a phrase in the dic- 
tionary scheme is decomposed into the space used by the individual char- 
acters. As an example, consider a greedy parsing scheme with S = {a,b} 
and dictionary {a,ba,bb}. Suppose the dictionary scheme assigns 1/4 of 
the code space to ‘a’, 1/4 to ‘ba’, and the remaining 1/2 to ‘bb’. The 
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P(a|A)=t+ P(b|A)=3 


Figure 9.5: The symbol-wise decomposition in Exercise 4. 


decomposition into a symbol-wise equivalent is shown in Figure 9.5.!* 


The statistical model is determined by the probabilities listed in the figure. 
Calculate the ideal number of bits assigned to each of the nodes by the 
symbol-wise equivalent (i.e., determine the number of bits used to encode 
an ‘a’ following ‘b’, etc.). Verify that that both the dictionary scheme and 
the statistical model give the same code length for ‘bb’. 


———————_—— eee 


9.3 Notes 


Many LZ variants are discussed in Text Compression [8], and summarized con- 
cisely in [82]. Some schemes possess characteristics from both LZ77 and LZ78: 
the LZFG algorithm [19] combines the history structure of LZ77 with the phrase 
structure of LZ78. 

In practice, the slow growth of phrases in LZ78 may degrade compression. 
Horspool [31] proposes modifications to LZW involving more rapid phrase 
growth and phased in binary numbers, as a way to improve the compression 
without significantly degrading the speed. Non-greedy parsing schemes (such 
as the lazy evaluation used in gzip) for dictionary schemes are considered in 
Horspool [32]. These trade time for limited compression improvements, with 
only minimal (if any) changes needed on the decoding side. The cost of deferred 
innovation is examined in Cohn [11]. 

The relationship between statistical and dictionary schemes is sometimes 
direct: some greedy dictionary methods can be decomposed into statistical 
methods which give the same compression (see Exercise 9.2.4). Gutmann and 
Bell [26] present an approach which is the opposite of this decomposition in an 
attempt to obtain the better compression of statistical methods with the speed of 
dictionary schemes. 


12 This example is adapted directly from Bell and Witten [9]. Their construction of a symbol-wise 
equivalent for any nonadaptive greedy parsing method is more involved than this simple example 
might suggest, and the interested reader should consult their paper and also [8,42] for a much more 
complete discussion. 
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Chapter 10 


Transform Methods and Image 
Compression 


Images are natural and efficient conveyors of information and have been used 
throughout history as models for both reality and abstract concepts. Our ap- 
petite for visual information seems insatiable, and efficient image management 
continues to be of pressing concern. An image can contain a large amount of in- 
formation (more than a thousand words?) and often translates to a data structure 
whose large size can pose problems to storage and transmission management. 
The situation only gets worse when several images or animation is involved. 

Up to this point we’ve confined ourselves to compression methods involv- 
ing zero information loss, i.e., to “lossless” schemes, and any of the schemes 
from Chapters 5 through 9 could be used on images. However, “images” have 
certain common features that are better exploited by methods designed specif- 
ically for them. Some well-known lossless compression schemes for images 
include GIF and PNG and each do a respectable compression job. But unless 
something surprising comes along from the realm of “lossless technology,” we 
are not likely to see anything more than incremental improvements over these 
two methods. Large gains in compression ratios will come by dropping the re- 
quirement that all information be retained in the compression process.! This 
quickly brings up the question as to whether or not we can actually remove in- 
formation from data in a way that allows it to be significantly compressed and 
yet doesn’t thoroughly corrupt it. Fortunately, a moment’s reflection is all it 
takes to convince us that for many applications images have room to give. As 
an example, consider a black and white photograph in a newspaper. The pho- 
tograph itself is a (compressed) model of the image it was meant to capture. 
Close inspection (a magnifying glass will do) reveals simply an array of black 
and white dots. And yet, this global arrangement of dots presents information 
in a way that allows us to readily grasp the message it was meant to convey. 
(sometimes we may need a little help from the caption). 

An image can contain more information than necessary to accomplish its 
purpose. If there was a way of identifying this “unnecessary” detail then we 
could compress an image by discarding the detail. The message behind an im- 


'The phrase compression ratio is used loosely throughout this chapter and generally refers to 
some way of comparing the size of the source after it has been compressed to its size before or vice 
versa. At times throughout the chapter we will be more precise. 
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age can be subjective, and so the “surplus” information. Avoiding issues of 
artistic representation, there do exist some general principles that lead to ac- 
ceptable solutions of the compression problem for “everday” images . They’re 
based on observations that the human eye can be rather insensitive to certain 
fluctuations in an image, and also quite tolerant of a wide range of approxi- 
mations. Within the environment defined by our subjectivity and the physical 
characteristics of our visual system, lossy schemes can flourish. The purpose of 
the chapter is to explore these “tried and true” principles and some of the lossy 
methods that use them. 

Two important such schemes are discussed toward the end of this chapter: 
the JPEG image compression standard in Section 10.5, and a wavelet technique 
in Section 10.6. Each one is supported upon a linear algebraic structure called a 
transform, i.e., a change of basis.. The theme behind compression in the widely- 
used JPEG standard is based on the observation that local visual information at 
high (spatial) frequencies is often not as important in our global interpretation 
of the image as the low frequencies. JPEG tends to suppress this high-frequency 
information and often eliminates it completely. To gain some understanding of 
the JPEG process, we’ll need to know something about the cosine transform 
and how it reveals image information in ways that enable us to decide what to 
keep and what to throw away. Since the cosine transform is an offspring of the 
Fourier transform, we devote a section to the motivation and development of 
the classical discrete Fourier transform and how it can be used as a compression 
device. Along the way we uncover some mathematical structure that is also 
useful in our discussions of wavelets. 

The thrust behind our development and presentation of the Fourier trans- 
form is pragmatic rather than theoretical; in brief, we approach it from a clas- 
sical signal analysis point of view. The function to be transformed will be re- 
garded as a signal in time with the information it contains being composed 
of several key signals of special frequencies. Later, Fourier transforms are 
extended to operate on two-dimensional “signals”, i.e., images, via a general 
method that works also with wavelets. The chapter ends with a section devoted 
to the JPEG image compression method and a section outlining an applied ap- 
proach to wavelet compression.” 

Many of the exercises in this chapter are designed to fill in details and 
to briefly explore what we think to be interesting topics in themselves, e.g., 
Shannon’s Sampling Theorem 10.2.11 and the Fast Fourier Transform 10.2.12. 
The reader is encouraged to try them. 


2The JPEG 2000 standard is wavelet based. See, for example, [46]. 
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10.1 Transforms 


Dividing an image in two and tossing away one of the pieces is a compression 
method that most of us wouldn’t tolerate well. Contrast this with a method that 
defines a detail structure within an image, tagging details with their level of 
importance. When information from an image is presented to us in this way, a 
lossy compression scheme becomes obvious: discard the least important details. 
Both JPEG and wavelet methods basically do this, even though the way in which 
they define detail is different. 

Images have mathematical representations as rectangular arrays of num- 
bers, typically of integers. Each pixel in the image is assigned an integer whose 
value, in some way, represents its color. In this chapter we identify arrays with 
images and images with arrays, often making no distinction between the two. 
Since arrays of numbers can be scaled (each entry multiplied by the same num- 
ber) or added together entry by entry without altering their shape, then an image 
can be thought of as a point in a linear space, that is, as a vector in a vector 
space. For example, if an image has m rows and n columns of pixels then we 
can think of it as a member of the vector space of m x n matrices: a point in a 
space of dimension mn. It’s important to note that the images we usually deal 
with in practice are not just arbitrary arrays of numbers corresponding to points 
scattered willy-nilly throughout mn-space. Rather, typical images share certain 
traits which, when regarded as points in mn-space, translate to a group geometry 
susceptible to quick approximation by several linear schemes. This observation 
is at the heart of lossy compression methods based on linear transforms. 

Mathematically, image analysis takes place in linear spaces. As such, we 
have at our disposal all of the processing tools from linear algebra. To use these 
powerful tools effectively, we’ll need to start with a good choice of fundamental 
or basis images (basis arrays). If chosen properly, these basic images can be 
effectively used to describe detail levels within a large class of images. 

The selection of basis images provides insight into the methods discussed in 
this chapter. For example, JPEG chooses them purely from a classical frequency 
content point of view while wavelet techniques attempt to blend “frequency 
content” together with the location of these frequencies in the image.*> Once 
these fundamental detail images have been defined we can then resolve a given 
image into a linear combination of them and by examining coefficients (i.e., 
amplitudes) weight the importance of particular detail image to the entire given 
image. 

An image is a special kind of signal, or vector, and the resolution process 
above is referred to in linear algebra as a change of basis. At the root of any 
(invertible) /inear transformation is a change of basis and corresponding to any 
change of basis is a linear transform. All transformations important to us in this 


3JPEG does this locally, throughout the image. 
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chapter are linear and so have at their foundation a special set of basis vectors. 
The Fourier transform is a well-known example and a starting point for us, but 
before we define it, let’s briefly look at some general reasons for transforming 
information in the first place. 

By asignal, we loosely mean some sort of ordered collection of information 
(ordered data) indexed by what will generically be referred to as time (usually 
discrete time). When such a signal is presented to us, we normally wish to do 
something with it; change it somehow, or extract information from it. To do 
the latter it is sometimes necessary to first do the former. For example, in this 
chapter our goal is to compress a signal by discarding some of the information it 
contains. How do we determine what to throw away? We transform the signal 
into a form in which, we hope, important features can be distinguished from 
unimportant and then keep only the important. 

Transform techniques for data analysis have been around for some time and 
scientists often speak of “transforming” their data during data analysis. Here is 
a definition of the word “transform” taken from the dictionary: 


transform: to change in structure, appearance, or character.* 
A more mathematical definition could read: 


transform: a rule used to exchange one set of objects for another. 
A function from one space to itself or another, i.e., a function. 


Neither definition, by itself, does much to explain how someone goes about ob- 
taining a useful transform. In a practical sense a useful transform will be more 
than just an arbitrary function: it should have some additional properties, e.g., 
perhaps it should be invertible (information preserving) or “easy” to compute. 
Changing variables or coordinates is usually done simply because the new co- 
ordinates turn out to be more convenient to work with than the old. 

As is an example consider the integral D el OP4y) ded y, where D is the 
unit disk in the plane R?. If the exact value of this integral is the goal then the 
standard polar coordinate transformation x =rcos@, y =rsin@ turns out to be 
a good choice: 


spite 2a pl 2 
park +y raxdy= f / elt rdrd@=x(e—l). 
D 0 0 


This is not the sort of transform we have in mind for image processing (for one 
thing, it’s not a linear transformation) but it does the job, and it’s suggested 
by the circular symmetry of the disk D and by the argument to the exponential 
integrand: it is suggested by information contained in the problem. 

The polar coordinate transformation is often an appropriate choice for a 
large class of problems exhibiting radial symmetries. An arbitrary change of 
variable would have been unlikely to produce anything useful. A useful change 


4The “The Merriam-Webster Dictionary,” 1974. 
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of variable not only can simplify form but can characterize it in ways that enable 
decisions to be made based on the signal’s “new look.” These decisions can be 
of a nature that would have been difficult, maybe impossible, to accurately make 
before the transformation. 

For our purposes a transform will be an invertible linear transformation on 
a single vector space. Mathematically, this is the same as exchanging one set 
of basis vectors of the space for another. Although this restriction eliminates 
polar coordinate and other nonlinear transforms, it is not as confining as may 
first appear. 


SSSSSS=a__————— 


10.2 Periodic signals and the Fourier transform 


Consider a basic sine wave tt» Asinwt, t € R. It carries two pieces of infor- 
mation: the scaling factor A (|A| is known as the amplitude of the wave), and 
its oscillation frequency w/2z (or, equivalently, its period 277/w). These two 
bits of information, along with the knowledge that the original signal was a sine 
function, allows it to be perfectly reconstructed for all time f, i.e., a “sine-wave” 
is completely characterized by its frequency and amplitude coefficient.> Not all 
signals are so simple. How do we distill down to the essential information that 
they contain? It might be nice to have our signals all defined on some common 
domain. The sine wave above is defined for all time ¢ whereas most signals we 
observe have a finite life. This “defect” can be fixed if we imagine extending 
what has been observed over a finite time interval to a periodic function defined 
for all time. 

In order to analyze periodic signals more 
complex than a particular sine wave, we start 
with perhaps the most fundamental of oscillations: 
6dr e? GER It maps the interval 0 < 6 < 27 
(or any interval of length 277) onto the unit circle 
and is the basic starting point for classical Fourier 
analysis. Fix a real number w and let 0 = wt (al- 
though not necessary, think of tf as representing 
time). This gives the map t +> e’®’, an oscillation 
about the unit circle that completes precisely one The Unit Circle 
revolution (clockwise if @ < 0 and counterclockwise if w > 0) of 27 radians in 
T = 2z/|o| units of time. The constant w can be thought of as angular velocity 
with units of radians (unit-less) per unit time. T is the period of the oscilla- 
tion and the frequency of oscillation is f = w/2m = +1/T cycles per second. 
Thus, a basic oscillation with frequency f can be described by the function 
tre et teER. 


i 


5 We should consider here a phase-shift also, but let’s assume that our sine-waves are zero when 
time is zero. 
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Figure 10.1: A periodic signal. 


Imagine a signal, that is, a function A of time f. It it’s helpful, you can think 
of h as an audio signal, i.e., a voltage level, fluctuating with time. Or scan from 
left to right along a horizontal line in a greyscale photograph; in this case ¢ is 
not a temporal variable but a spatial measurement and h(t) the shade of grey at 
position t. We can’t observe a signal forever (even if we can imagine it lasting 
that long) so we watch it for awhile, say T > 0 units of time. By replaying this 
piece of signal over and over again we can think of it as defined for all time, 
i.e., we regard the portion of the signal sampled as just one period of a period 
T function defined on all of R, c.f., Figure 10.1. It could be possible that this 
period T signal is built up from a few basic, more elementary period T signals. 
But before we try to find out exactly what these elementary period T signals are 
and how they can be used to synthesize h, we’ll first try and make precise our 
notion of an elementary signal. 

What could be a simpler example of an elementary period T signal than 
the oscillation t +> e!@! referred to and pictured in the figure of the unit circle 
above. If, for each integer n, we set @, = 221n/T = 27 f,, then the map 
=e"! tEeR 


tre el@nt 
completes n trips around the unit circle (counterclockwise if n is positive and 
clockwise if n is negative) during the time interval 0 < t < T. 

Exactly one of these basic signals exists for each integer n; that is, for each 
néZ,tr> eth! is a signal on R of period T and frequency f, =n/T. The 
collection of all of these fundamental signals 


A= [tvs emis |n eZ} 


is a set of raw material we can use to generate other period T signals. The 
collection A contains infinitely many different signals and they each have a 
common period T of oscillation. However, even though A is large, it does not 
contain enough functions for our purposes, in that it’s easy to write down a 
signal with period T that does not belong to A. We need a bigger set. It doesn’t 
seem prudent to enlarge A any more than necessary, and since we’re trying to 
generate period T signals there seems to be no reason to add signals other than 
period T signals. One natural way to add more such signals is to combine those 
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period T extension 
an 


1 


Figure 10.2: A square wave. 


that are already in A. For each complex number ay, the map tf +> aye?7!/"" is 
still an oscillation with period T (frequency /f,,), it’s just not on the unit circle 
any more; instead the motion takes place on a circle of radius |a,|, that is, the 
amplitude of the oscillation is |a,|.© Going a step further, take two amplitudes; 
|a,| and |a,|, and two frequencies; f; and f,, and sum their corresponding 
oscillations. This signal 


2wifxt 


th age tqe™ st +eR 


has period T and generally differs from anything in A.’ 
To reach other period T signals, combine more than just two oscillations: 


select k numbers a),...,a, and k frequencies (basic signals) fy,,..., fn, and 


2mifn,t Se . ‘ i 
Jn" This linear combination of e271‘, . 


form the function t ee aje a 
77 in! from A will always have period T (Exercise 10.2.2). Since A contains 
an infinite number of signals then we should expect this linear combination pro- 
cess to generate a tremendous number of new period T functions. It does—but 
the resulting set, let’s call it span A, is still not large enough to contain the 
signals we might be interested in. For example, extend the map, t + 1 on 
O<t<T/2andtth —-1 on T/2 <t <T toa period T map on R; see Fig- 
ure 10.2. This square wave jumps at 0,+7/2,+7,... and does not belong to 
span A because span A contains only continuous functions. 

However, we can obtain square waves and other discontinuous functions, if 
we allow linear combinations of infinite numbers of oscillations from A. There 
are convergence issues and subtleties associated with infinite series of functions 
that arise when we do this but these are issues we will avoid. In fact, in a few 
paragraphs we’ll be back to considering only finite sums. For the interested 
reader though, good resources dealing with convergence questions and basic 
(Fourier) analysis include [61] and [67]. 


Sif |an| = 1 then the resulting oscillation is still on the unit circle but, unless a, = 1, it starts out 
of phase with the rest of the signals in A. 

7Multiplying basic signals t e2FiSkt and tre e2tifnt together results in a basic signal, i.e., 
is an element of A (cf. Exercise 10.2.1) and doesn’t give us anything new. Scaling a basic signal, 
however, is a way to escape the group A. 
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Figure 10.3: A step function approximation to h. 


Denote by B the formal collection of all functions generated from A in this 
manner. Thus, 


B= {t i ogee | an € Cc}. 


neZ 


It turns out that 6 will include just about any period T signal we are likely to 
encounter in practice.8 The sequence n +> ay of coefficients is sometimes called 
the Fourier transform of a signal )°,,-7 ane?™'n! from B; however, we normally 
start with a signal h and not its Fourier coefficients a,. We need a recipe for 
converting a signal h to its Fourier coefficients, i.e., a way to compute from h a 
sequence of coefficients (ay) n-z so that 


NOS er Ostet, (10.1) 


n 


Exercise 10.2.4 leads to the simple formula 


or ic se 
C= / h(t)e 27's" dt (10.2) 
T Jo 


which allows each a, to be computed directly from h. The formula holds when 
h is known to have a convergent expansion (10.1). 

Equations (10.1) and (10.2) are interesting formulae and find their way into 
a variety of engineering applications, but we’re after something different. For 
one thing, in practice we can hardly ever hope to know a signal h for every value 
of f in some continuous time interval 0 < t < T. Rather, a signal is typically 
measured or sampled at discrete moments t,,n =0,..., N withO <t, <T. It’s 
quite common to keep the sampling interval constant and T some multiple of it. 
The result is a collection of sampling times ¢, that are equi-spaced between 0 
and T. More precisely, if N samples of the signal are desired, then the sampling 


8 As long as the signal h is of finite power, i.e., fo |h|2 < 00, and provided that we are willing 


to relax the condition that equality in (10.1) hold for ail t in the interval 0 < t < T to that of holding 
for nearly or almost all t in this interval. 
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interval is T/N and the sample times are 0,T/N,...,(N —1)T/N. The effect 
of this discrete sampling procedure is equivalent to replacing the continuous 
signal h with the step function approximation h defined by? 


h(t) =h(kT/N) for kT/N <t <(k+1)T/N and k=0,...,N—1. 


Figure 10.3 contains an illustration of this process and suggests that we should 
really be using a discontinuous step function hin place of the ideal signal / in 
equations (10.1) and (10.2). 

How will the use of h instead of h change things? Ifk/T <t<(k+1)T/N, 
then h(t) = h(kT/N) and (10.1) becomes 


A(t) =h(kT/N) = Seer = sae, (10.3) 


neZ neZ 


Here is how to get a finite sum from this last expression: since e?*/ = 1 for 
each integer j then e27/@+/N)K/N — 2rink/N | sy grouping terms and factoring 
reduces (10.3) to a finite sum 


N-1 
hy =h(kT/N) a nee 
n=0 


where b,, is defined from the sequence (a,) by the relationship b, = )~ jeZUtjN- 
The coefficients (a,)y,¢z of (10.3) now fade into the background and we no 
longer worry about them; our problem is now finite dimensional. 


Problem Given a discrete signal h = (ho,...,n—1), find a vector (of ampli- 
tude coefficients) b = (bo,..., bn—1) so that foreachk =0,...,N—1 


N-1 
hy = > been (10.4) 
n=0 
This is a well-posed mathematical problem and for a given vector (signal) h we 
can solve these N equations for the N unknowns in b by methods learned in 
any linear algebra course, e.g., Gaussian elimination. However, there is more 
to system (10.4) than can be seen at first glance because the orthogonality rela- 
tionship!° 


N-I . . 
» 2mink/N ,—2ninj/N _ N, ifk=j, 

_ =|o) inezy uo 
n= 


developed in Exercise 10.2.5, permits us to easily describe its solution. From 


In practice, the values of this step function are also determined by the number of bits used to 
measure or resolve the value of the signal at the moment it is being sampled—a quantization effect 
which we shall not be concerned with. 

10The word orthogonal is used to generalize the notion of perpendicular and is typically used in 
dimensions higher than 3 or when the inner product is different than the usual dot product. 
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(10.4) and (10.5) one can show (cf., Exercise 10.2.6) that 
= ay Da tne mini, k=0,...,N—1. (10.6) 


This equation defines a map h+> b, from C’ back into C, and the vector b is 
often called the (discrete) Fourier transform of h. 

In this text, though, we reserve this title for a slightly modified form of 
(10.6), which has more symmetry. Set h= Nb in equations (10.4) and (10.6) 
to get the form of what, henceforth, will be called the discrete Fourier transform 


N- 
h(v) = LF nwt, v=0,...,N—1, (10.7) 
Th 
k=0 
1 N-1 
h(k) = —= She?" kk =0,...,N—1. (10.8) 
JN 
v=0 


The vector h = (ho, ... hi hn- 1) defined by (10.7) is called the Fourier transform 
of h = (ho,...,hn— 1). i The vector h defined by (10.8) is called the inverse 
Fourier transform of h.!2 The two systems (10.7) and (10.8) enable us to com- 
pute either h fromh or h from h, that is, given one of them, we can compute the 
other. 

Each of (10.7) and (10.8) defines a linear transformation on C% (Exer- 
cise 10.2.3), and hence have matrix expressions. If we let W be the N x N ma- 
trix whose entry in the jth row and kth column is W(j,k) = (1//N)e2744/N , 
then (10.7) and (10.8) assume simple forms: 


h=Wh (10.7’) 
h= Wh. (10.8’) 


W is notation for the matrix whose entries are just the complex conjugates of 
the entries in W. It’s worth noting that (10. va ) and (10.8 8’) together imply that 
h = WWh for all h € C or that h= WWh for each h € C" either implies 
that W—! = W. This observation, together with symmetry of W, allows for an 
easy proof showing that the Fourier transform is an isometry on CY, i.e., a map 
from C% -+ C that preserves lengths of vectors (Exercises 10.2.7 and 10.2.8 
have the details). 

There is another way to regard the Fourier transform, one which has the 
advantage of enabling us to identify vectors as the frequency elements that com- 
pose the signal. Let W; denote the kth column vector of the matrix W, i.e., for 


'lIn reference to sequences and vectors, we'll interchangeably use the notation x(k) and x, to 
denote the same kth entry of the sequence or vector x. 

!2Be warned: some sources may refer to (10.8) as the Fourier transform and (10.7) as the inverse 
transform. 
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k=0,...,N—1 put 


1 
e2nik/N 


Wi = (10.9) 


1 : 
UN e2tisk/N 
e2mi(N-Dk/N 


Equations (10.7) and (10.8) can now be written as 


h= Do h(k)We (10.7") 


h=)> h(vyW,. (10.8”) 


The columns of W, then, are the basic signals or elements associated with the 
Fourier transform. From a linear algebra perspective, the columns of W form 
a basis for the vector space C™ and the Fourier transform h of h is just the 
(ordered) collection of coefficients needed to expand h with this basis. 

Note that the jth component of the basis vector W, is just 


Wi(j) = sqentinys 


and, hence, the entries in the kth column of W are generated by taking suc- 
cessive powers of e77'*/" | The column (vector) Wx, like any vector, is just a 
function of its index j,!° and since the right-hand side of the above equation 
makes sense for any integer j then it provides a natural extension of W,; from 
j=0,1,...,N —1 to all of Z. Also, since (e274/N )J+N = (e27#4/N)i for all 
integers j, then the extension is a sequence with period NV. 

When regarded as a function on Z, each “column” Wy, oscillates with pe- 
riod N and, because time is measured discretely, the frequency of oscillation 
increases as the argument 27k/N gets closer to 7, that is, when k ~ N/2.'4 
Consequently, high frequency oscillations correspond to columns at the “mid- 
dle” of the matrix W, i.e., Wx with k at or near N/2. The columns on the 
left and right of W oscillate at lower frequencies. Also, the orthogonality re- 


13 An N-vector v is a function of its index: v: {0,1,..., MW —1}— C with v(k) = vx. 

14 For each Bs (e27!K/N )J is a point on the unit circle. The map j > (e27#k/N )i defines a 
sequence of points that march around the unit circle with time j. To see that these points march 
“faster” when k * N/2 consider the following argument. For any real number ©, there is a unique 
6 with —2 <6 <7 and e!® =e? (in fact, if you like, you can write 6 = © (mod 27) —z). If 
e!® is close to 1, then 6 must be close to 0 and the point (e!?)J = e4/9 marches slowly around the 
unit circle as j ranges over the integers. At the other extreme, if e!® is close to —1, then 0 is close 
to either 7 or —z and the argument j@ changes more dramatically with j; the net effect is that the 
point e4/ g jumps rapidly around the unit circle as j clicks from one integer to the next. 
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Figure 10.4: Real and imaginary parts of partial sum approximations to x. 


lationships (10.5) imply the orthogonality of the vectors Wx, k =0,...,N—1. 
Since each W, has unit length, then the collection {W,; |k =0,...,N —1}is an 
orthonormal basis for C, cf. Exercise 10.2.9. 

We end this section with a quick example of a vector x and its Fourier 


transform X: 
0.0 43.31 
235 —5.82 + 17.96 
5.0 —4.42 + 8.84: 
1s oe 1 aah Sea. B05 
2 = lo09 1% * | 619 
20.0 —8.32 — 2.05 
27.5 —4.42 — 8.84: 
35.0. —5.82 — 17.96: 


Figure 10.4 contains some of the partial sums from the Fourier expansion (10.8”) 
of x (split into real and imaginary parts). The solid piecewise linear curve 
is a graph of x, the dotted curve is obtained from the first 2 terms from the 
sum (10.8”), the dash-dots curve from the first 4, and the dashed line from the 
first 6. 


10.2.1 The Fourier transform and compression: an example 


The Fourier transform X of a signal x is a vector containing the amplitudes of 
the fundamental frequencies that make up x. Each component of X indicates 
the strength of a particular frequency in x. In certain classes of signals, e.g., 


5Unless otherwise stated, we take the usual inner product in cy: (Vv, Ww) = paar, vpwy for 


vectors V= (vp,..., vy—1) and w= (wo,..., wy-_-1) in C’. A collection of vectors is orthonormal 
if the vectors are mutually orthogonal and have unit length. 
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audio signals, entire frequency ranges may not be relevant or meaningful to 
our interpretation of the signal’s quality. The Fourier transform gives us direct 
control over these frequencies: replacing an entry in XK with zero “removes” the 
corresponding frequency from x. 

The decision to remove frequency information may be based on mathe- 
matical or physical importance of the frequencies. Coefficients in X of small 
magnitude indicate frequencies with weak mathematical presence in x, and dis- 
carding them may be done with relative impunity. In applications, there may be 
physical considerations which allow the suppression of certain frequency infor- 
mation, even if these frequencies have significant mathematical presence. For 
example, if very high frequencies are suppressed in an audio signal, then the 
signal changes, but we may not be aware of it.'© The approximation obtained 
by zeroing certain coefficients is a special case of quantizing, a method which 
reduces the precision of coefficients, and which will be discussed in connection 
with JPEG in Section 10.5. Thoughtful quantizing can help suppress both non- 
meaningful and weak mathematical frequencies simultaneously. A similar story 
holds for wavelet transforms and the process of selectively eliminating or ap- 
proximating transform coefficients provides a foundation for the lossy schemes 
discussed in this book. 

Let’s examine the action of the Fourier transform on the two signals 


0.0 20.0 

2.5 14.2 

5.0 0.0 

12.5 ~10.0 
X=J]500| ad Y=] 150 
20.0 ~10.0 

27.5 0.0 

35.0 14.2 


Think of x and y as single periods of two larger signals whose graphs appear in 
Figure 10.5. The Fourier transforms X and ¥ are approximately 


43.31 4.74 
—5.82 + 17.96i 24.47 
4.42 + 884i 1.77 
= | -832 4 2.05% “ 0.27 
cummed (a and Y=} _ 159 
8.32 — 2.051 0.27 
4.42 — 884i 1.76 
—5.82 — 17.96i 24.47 


One clear difference in these two vectors is that Y is real and Xx isn’t. A glance 
at (10.7) shows that the Fourier transform generally outputs complex vectors 
even if the input vectors are real. So why is ¥ real? Is it just by chance or is 
y special in some way that makes vectors like it have real Fourier transforms? 
This question is something we’ll return to in the next section, but for now we’re 
content to emphasize that each entry of X and y measures the amplitude of a par- 


16a dog might be able to notice the change. 
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Figure 10.5: Signals x and y, and their periodic extensions. 


ticular frequency component of x and y, respectively, and that high-frequency 
components correspond to middle entries of X and ¥. 

Before we do additional analysis, some notation is required. If z €¢ C% 
is any N-vector of complex numbers, then define Abs(z) to be the vector of 
absolute values or magnitudes of z. Thus, the kth component Abs(z)(k) of 
Abs(z) is just |z(k)|, that is, Abs(z)(k) = |z(k)|, k =0,1,..., WN —1. With this 
notation, consider 


43.31 4.74 

18.88 24.47 

9.88 1.77 

_ | 8.57 oN O2T 
Abs(x) = Pio and Abs(y) = 120 
8.57 0.27 

9.88 1.77 

18.88 24.47 


From these magnitudes it’s clear that high frequencies are more prevalent in x 
than in y. We could wonder whether this difference was apparent before their 
transforms were taken. Look again to the graphs of x and y in Figure 10.5. High 
frequencies can be “spotted” by looking for abrupt changes in values over small 
changes in (in our case, discrete) time, rather than gentle trends. If our attention 
is fixed to only the part of the graphs over the integers 0, 1,..., 7, then we might 
be led to believe that x is as smooth, if not smoother, than y. Over just these 
eight integers this may be true, but remember, the Fourier transform sees these 
signals as defined on all of Z, not just 0,1,...,7. 

Since each W, has the same length (unit length in this case) then the kth 
Fourier coefficient of a signal is all we need to determine the importance of its 
frequency component in the W, “direction.” If the vectors W; were of differing 
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lengths then the relative importance of a particular frequency W;, could not be 
reliably determined by examining its Fourier coefficient alone. In this case, 
throwing out “small” coefficients, thinking the corresponding frequencies are 
unimportant, could be a mistake.!’ This is a good argument for normalizing 
basis signals. 

Because the high-frequency entries in Y are quite small when compared 
to other coefficients (particularly when compared to the low-frequency entries) 
they may not play much of a role and we could ask what would happen to y if 
we left them out. More specifically, how does it affect y to set ¥(2) = y(3) = 
y(4) =y(5) = (6) = 0? Doing this gives a vector, say Z, where 


Tis not the Fourier transform of y, but it’s natural to ask which vector z has Z as 
its Fourier transform? 7 is not too far from Y so we could hope that z is not far 
from y. It is easy to compute z using the inverse Fourier transform, (10.8), on 
Z. Then z and the entry-by-entry error z— y are given by 


18.98 1.02 
13.91 0.28 

1.68 —1.68 

_ | -10.56 d _ | 0.56 
P= 9565) 88 2-2 = na 
~10.56 ~0.56 

1.68 ~1.68 

13.91 0.29 


Whether or not this error is acceptable depends on the purpose of the original 
signal and on how accurately it needs to be known.!® In any event, consider 
this: we threw away 5/8 or 62.5% of the components of y and we were able to 
invert and get something that appears to be fairly “close” to the original y. 
Now let’s repeat the above procedure on x. In fact, let’s keep even more 
of x than we did of ¥ by setting just the three smallest components x(3), X(4), 


'7For example, in a compression scheme, tossing out an innocuous looking coefficient could 
prove dangerous if the associated basis element’s magnitude is much larger than some of the others. 
More precisely, suppose u and v have the same length, say |u| = |v| = 1, and that w = au-+ bv. If 
la| < |b| then a/b © 0 and w = D[(a/b)u+ v] © by. On the other hand, if |u| is much larger than 
|v| then, even though a/b ~ 0, it could be the case that |(a/b)u| >> 0. The statement w ~ bv could 
then be extremely misleading. 

18 Often engineers will use the I? or Euclidean norm to measure error between two vectors v and 


w: llv— wll = />o; lug — w,|2. Since the Fourier transform is an isometry (in this [2 sense, see 


Exercise 10.2.8) then the /2-error found after reconstruction of y from a modification of its transform 
will be exactly the same as the error introduced into its transform y. This practical feature is shared 
by all orthogonal (and unitary) transforms. 


© 2003 by CRC Press LLC 


260 10 Transform Methods and Image Compression 


and x(5) to zero (note that these coefficients are not, relative to the rest of the 
components, as small as the smallest components of y). Call this modification 
Z again and invert to obtain 


8.07 8.07 
~2.83 5.33 

5.74 0.74 

_ | 15.50 d _ | 230 
Be gag, oe Ee ag 
20.95 0.95 

31.13 9.63 

27.63 ~737 


We didn’t do as well in this case even though we altered less frequency informa- 
tion. Does this mean that x cannot be compressed effectively? No, it could just 
mean that the Fourier transform is not the right compression tool to use on x. If 
signals like x need to be compressed, then perhaps better results could be ob- 
tained by using a different set of basis signals than the Fourier {Wo,..., Wy_1}. 

Just how to construct such a “compression” basis can be a problem that is 
not easily solved. The above strategy can be thought of as a “projection” scheme 
in the sense that setting Fourier coefficients to zero projects the transform vector 
(orthogonally) into a subspace of smaller dimension (dimension 3 in the case 
of y, 5 in the case of X). Projection methods can work if the subset of the data 
type space from which we will select vectors to compress is “thin” in several 
directions.!? In this setting, the job of choosing basis elements amounts to 
finding these special “directions.” 

The following example may better illustrate our meaning here. Consider 
the set 


a 412 443 . 
G= a2 a a73 :a,p >1, laij|}<<«K 1lifi Aj 
43, 432 8B 


G is a subset of the (9-dimensional) space of 3 by 3 matrices and can be de- 
scribed “well” using only two matrices; 


1 0 0 0 Oo 0 
M,;=)]0 1 O}, Mo=]0 0 0O 
0 0 0 0 0 1 


To fill out a basis, select seven more 3 by 3 arrays having directions as different 
as possible from M, and Mo, i.e., choose them orthogonal to the span of My 
and M>.”° The nine arrays will form a basis for the space of 3 by 3 matrices. 
With respect to this basis, the subset G is “thin” in any basis direction other 
than M,; and M2. One could ask why it has been suggested that the other basis 
elements be chosen orthogonal to both M; and M2—after all, to get a basis they 


19 an effect of quantizing can be to turn a “thin” set into a “thinner” discrete set. 

20Think of 3 by 3 matrices as vectors in R° and use the inner product there to determine orthogo- 
nality. Also, see Exercise 10.2.10 for more on why these directions are as different as possible from 
each other. 
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only need to be linearly independent of M,, Mz and each other. In answer to 
this suppose we take as a basis element an array that is not too different from 
Mp (and definitely not orthogonal to it), e.g., 


0 0 0 
M= 0 0 0 
0.001 0 1 


M doesn’t belong to the span of {M 1, M2} but its close proximity to Mz means 
that it plays a significant role in describing any array that also relies on M2, 
in particular, in describing arrays in G. In other words, G isn’t “thin” in the 
direction of M. To see this more precisely, the matrix 


0 0 
A=] 0 2 0 
0 2 


certainly belongs to G, but since A = 2M, + M2+M, its construction from any 
basis containing M;, Mz and M requires the “same amount” of M as M2! 

In summary, we were able to suppress much of the information contained 
in y (setting 5 of the 8 coefficients to zero), still having something whose inverse 
transform looked like the original y. The Fourier transform may not have been 
the best tool to use in compressing x. We can’t dismiss the Fourier transform 
so easily though, and interestingly enough it will provide us with a better tool 
to use on vectors like x. We just need to better understand the way in which 
the Fourier transform looks at signals as single periods of larger signals defined 
for all (discrete) time. In this context, the Fourier transform will also aid us in 
understanding why ¥ is real and X is not. In the next section we pursue this 
thread, leading us to the cosine transform. 


Exercises 10.2 


Throughout these exercises, T is a nonzero real number and, for each n € Z, 
fai=n/T. 
1. The collection G = {e?"//n' | t € R,n € Z} forms a (commutative) group 
under pointwise multiplication.?* To check this, show the following: 


(a) e277 Int g2ti fmt — ei fn+m!; multiplication of two elementary signals is 
an elementary signal, 


21 Both M and Mp are about the same magnitude so their coefficients are comparable. 
224 group is a basic algebraic structure. It consists of a set G together with a multiplication on 
G such that 
(a) ab € G whenever a and b are in G, 
(b) a(bc) = (ab)c whenever a, b, and c belong to G, 


(c) there is an identity element e in G, i.e., there exists e € G such that ae = ea = a for every 


aeG, 
(d) each element in G has an inverse in G, i.e., for each element a € G there is an element a ‘eG 
with aa~! =a7!a=e. 
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(b) e?7'fo' — |; the identity element f +> 1, is an elementary signal, and 
(c) e27!Int e271 f-nt — 1; each elementary signal has an inverse. 


To reach functions outside of G requires some operation other than just 
multiplying its elements together. 


. (a) Ifaj,...,ay are any N complex numbers and n1,...,ny any N inte- 


gers, then show that the map f > oe agen! has period T. 


(b) If a, € C for each n € Z then (formally) show that the map t b& 
rez ane?!" still has period 7. 


. Show that the Fourier transform map ~: CY — C% defined by (10.7) is 


a linear map; that is, show that ag+ Bh =ag+ ph for any a, B € C and 
ghecCy, 


. (a) Show that the functions e?7'/"' are orthogonal over the interval 0 <t < 


T; that is, show that 


fe : 
of op T, ifm=n 
2nifnt ,—2Tifmt = ’ 
[ si a={5, ifm An. 


(b) Now suppose that h(t) = )°pez age’! for0<t<T. Multiply 


each side by e~*7'/"! and then integrate both sides over the interval 
0 <t <T. Use part (a) above to conclude that 


ie es 
m=z h(tye 27"! dt. 
T Jo 


. Equation (10.5) 
N-1 pes 
SS p2mink/N ,—2ninjIN _JN, if j =k, 
“ 0, iff 4k 
n= 


is a discrete version of the orthogonality relationship in Exercise 4(a) above. 
To see this, fix N € Z and, for each k € Z, put z, = e27k/N | Now do the 
following: 


(a) Show that z = zx is a solution to the equation zN—1=0. (The N 
distinct complex numbers zo, ...,Zy—1 are known as the Nth roots of 
unity.) 

(b) Plot zx, fork =0,...,N —1 for several values of N, say N = 2, 3, 4, 
and 8. 

(c) Argue that, if z; #1, then 1" 7) (¢,)™ =0; ie., that S75 e2mimk/N — 
0. Hint: The expression z% — 1 can be factored: z% —1 = (z-1d+ 
zt274---42Ne1). 

(d) Now prove the orthogonality relationship above. 


© 2003 by CRC Press LLC 


10.2 Periodic signals and the Fourier transform 263 


6. Starting with equation (10.4) and the orthogonality relationship (10.5) show 
that equation (10.6) follows, i.e., show that 


N-1 
1 = bs 
me 2 2mink/N k=0,...,.N—1 


whenever hy = ar by e2TinkIN | 
7. Let W(j,k) = /VN)e274*/N for 0 < j,k < N—1 and show that the pair 
of equations (10.7) and (10.8) can be written as the pair (10.7’) and (10.8’). 
8. If v= (vo,..., vv_1) and w = (wo,..., Wy_1) are vectors in C’ then we 
define their inner product (v, w) to be 
N-1 
(v.w) = >) vik)w@). 
k=0 


The length |v| of a vector v is defined in the usual way 


N-1 
Ivi= | So lv@)/. 
k=0 


(a) If ve CN then show that (v, v) is always a nonnegative real number 
and that |v| = ./(V, v). 

(b) If Ais an NxWN matrix of complex numbers then show that (Av, w) = 
(v, A*w) where the matrix A* is the conjugate transpose of A. 

(c) Using equations (10.7’), (10.8’) and parts (a) and (b) above, argue that 
(h| = [h| for any h € C’. This shows that the Fourier transform is an 
isometric automorphism on CY, 


9. Use the orthogonality relationship (10.5) or the relationship W~'! = W 
from page 254 to show that the columns W;, k =0,..., N — 1, of the matrix 
W exhibit the following property 


l, ifj=k, 


cw) Wa) = {9 if j £k. 


Thus, the columns of W form an orthonormal set of vectors in Cc’. What 
about the rows of W? 


10. Let u and v be orthogonal unit vectors in C’. Suppose that w € CY is 
another unit vector not orthogonal to u. Then show that there is a complex 
number @, with |a| = 1, such that ||~@u — w]| < |lju—v||. 


Remark: The exercise shows that, direction-wise, orthogonal vectors are 
further from each other than nonorthogonal vectors. 
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11. Shannon’s sampling theorem. Suppose that h is a continuous signal de- 
fined on the interval [—T/2, T/2] of length T. Then, from equations (10.1) 


and (10.2) 
h(t) = Smet (10.10) 
neZ 
where f, =n/T and 
1 rt? ; 
dn = al h(tye 77"! dt. (10.11) 
T J_ry2 


The sequence of coefficients, (an)n<z, is called the Fourier transform h of 
h, i.e., h(n) = ay. If, instead of being defined on some finite interval, h is 
defined on all of R then it may still have a Fourier transform.”* Generally, 
to synthesize such a signal we need to use basic signals of all frequencies 


ft 
Oe i Apert af, (10.12) 
R 


n(f) 2, h(te 7" dt. (10.13) 
R 


A signal h : R — R is called “band-limited” if for some frequency f;, its 
Fourier transform h(f) = 0 whenever | f| > f., that is, if h is composed of 
only a finite range of frequencies. 


(a) Ifa(f) =0 for |f| > fc, then use (10.12) to show that 


fo 
h(t) =| AC fer" df. (10.14) 
Use (10.10) and (10.11), with hin place of h, to show that 
ACA) = ocnem res, (10.15) 
neZ 
where 
ey hae —2nistf 
Ch = —| h(foe 2fe’ df. (10.16) 
2fe J—f. 
T/2 


23 Tt turns out that h € L[-T/2, T/2] (ie., Srp |h|2 < 00) if and only if there is a sequence 
(an)neZ € 2 (ie., Vez lan |? < oo) and (10.10) holds. In this case the sequence (an)nez, (the 
Fourier transform of h) is given by (10.11) and (Parseval’s theorem) 7,67, lan 2 = fo |hl2, ie., 
an nezlle2 = WAlle2[—7/2,7/2]- Similarly, h € L?(R) (Le, Ir Al? < oo) if and only if there exists 
a function fh also in L2(R) such that (10.12) holds. In this case the function @ (the Fourier transform 
of /) is given by (10.13) and (the Plancherel theorem) / |A|2 = Ir nl, ie., ||h lz2(R) = All 2a). 
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(b) Now argue from (10.14) and (10.16) that 


os meee 
(-#)= es 


and, hence, from (10.15) 


~~ 1 n rit f 1 n nit f 
Oa ea 


neZ neZ 


(c) Use this last equation in (10.14) and integrate term by term to get 


1 n \ Sin2z fe(t — x7) 
ios hb | 
=a, D"(zz) 3) 


neZ 
(d) Finally, letting A = 1/2 f, in the above sum, show that 
sin 27 f(t —nA) 


h(t) = AY“ h(nd) eas 


neZ 


This last equation is known as Shannon’s sampling theorem. It allows 
the reconstruction of a continuous signal h everywhere if its values are 
sampled at a rate at least twice as frequently as the critical value fi. 
The cut-off frequency f, is called a Nyquist critical frequency and A 
the corresponding sampling interval. 


12. A fast Fourier transform. To compute the Fourier transform h directly from 
its definition 


N-1 
x 1 ; 
h(v) = = Sone P""/N, yp =0,...,N-1 (10.17) 
VN k=0 


requires basically N* add-multiply operations (you should count them!). 
Fast Fourier transforms (FFTs) are attempts to speed up this computation 
by more efficiently handling arithmetic. 


In parts (a), (b), and (c) of this problem, we’ll assume that N can be factored 
N = pip. 


(a) Convince yourself that each of the indices k and v can be expressed in 
the following forms: 


k=kipitko; forkp =0,...,p;—landk; =0,...,p2—1 
v=vuip2+v0; forvp9 =0,...,p2—1andv; =0,...,p;—1. 


By the end of the exercise we hope to see that even if the signal length N is prime it could be 
advantageous to pad its length, to say a power of 2, with zeros and use an FFT. 
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Consequently, the transform equation (10.17) can be written as 


po-\pi-1 


h(v) = Fi. dX Y> aki pi tke" 


=0 ko=0 


(k Lei hoe 


(10.18) 


(b) Now argue that 


. (ky py tkg)u . vky . voky Py 
212i —2ni— .—2r%i 
e N =e e N 


and hence, from (10.18) 


wm ee 1 
R(v) = — oe FPHEKo, v0) 
Py ky=0 
where 
p2-1 
Rx, y) = — Do nik pi tye" 


VP? x0 


forx =0,1,...,p;—1and y=0,1,..., p2—1. 

Argue that there are exactly N different h(x ,y) and each one takes p2 
add-multiplies to compute. To compute them all requires Np2 add- 
multiplies. After they are computed we can go about computing the 
hiv). Convince yourself that now p; add-multiplies are needed to 
compute each hiv). Thus, the total number of add-multiply opera- 
tions required to compute hiv), v=0,1,..., WN —1, in this manner is 
Np, +Np2 = N(pi + p2). Part (d) compares this number with the N? 
operations required of (10.17) when it is used directly. 


(c 


wm 


(d) This process can be repeated: if N = pi p2--- pj; then the total add- 
multiplies will be N(pi + p2+---+ pj). Now take the special case 
that pj = p2 =--- = pj = p so that N = p/. Show that the total 
number of add-multiplies is Njp = pN log, N. 


The computational savings using an FFT can be considerable. To get some 
idea of how much faster the FFT can be, take the case p = 2 and look at 
the ratio of the number of add-multiplies using the FFT (2N log, N) to the 
number N*: 2N log, N/N* =2log, N/N. Inasignal with N = 2!° = 1024 
(a relatively short signal) this ratio is 


2log, 2’? 201 
210 = 510 ~ 50 ; 
The FFT transforms this signal with about 50 times fewer operations than 


the direct use of (10.17). Approximate the speed increase for a signal with 
N=2” 
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Figure 10.6: Even extension of a signal about k = N — 1/2. 
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10.3 The cosine and sine transforms 


The Fourier transform is a map between N-periodic sequences of complex num- 
bers. Figure 10.5 illustrates these extensions and so can help explain the high 
frequencies prevalent in x but not in y. The sequence x, as seen by the Fourier 
transform, makes a considerable jump each time k goes from —9 to —8, —1 
to 0, 7 to 8, 15 to 16, etc. The extension contains an artificially introduced 
“high-frequency” blip; after transforming x we see this behavior reflected in 
significant high-frequency components. On the other hand, y was rigged so that 
it didn’t exhibit such large endpoint differences, i.e., compare x(7) — x(0) = 35 
to y(7) —y(0) = —5.8. Figure 10.6 suggests the possibility of extending a signal 
in a way that avoids introducing a high-frequency blip. 

First start with a signal x, defined at times k = 0,..., N — 1 and then extend 
tok=N,N+1,...,2N—1 by reflecting its graph across the vertical line that 
passes through the horizontal axis at the point k = N — 1/2 (see Figure 10.6). 
Mathematically, this amounts to the definition 


x(N+K) :=x(N—(k+1)), k=0,...,N—1. (10.19) 


The resulting signal, still call it x, is defined fork =0,...,2N — 1 and extends 
the original x symmetrically. It has the property that x(0) = x(2N — 1), ie., the 
endpoint values now match. This type of extension is usually called an “even” 
extension or, more precisely, an even extension centered at k = N — 1/2, and 
the Fourier transform sees it as now having period 2N instead of N. 

Apply the Fourier transform to this new (period 2) signal and then use 
Euler’s identity 2cos@ = e!° + e~'® repeatedly. The result is eventually a linear 
combination of cosine functions in place of exponential functions; ergo, the 
name “cosine transform.” Details of this process are outlined in Exercise 10.3.3 
and result in the following pair of equations: 
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N-1 
2k+1 
#(v) = S>x(C(v) cos APO v=0,...,N—-1, (10.20) 
SS 2N 
a (2k + 1)u 
k= YRW)C FS 0 I, 10.21 
x(k) 2ew) (v) cos (10.21) 


where C(0) = /1/N and C(k) = /2/N if k £0. System (10.20) is often 
called the forward cosine transform and system (10.21) the backward or inverse 
cosine transform. One difference between the cosine transform and the Fourier 
transform is that the cosine transform is real, in the sense that if x is real then 
so is its cosine transform x.2° 
Both of the above systems have matrix representations. Define the N x N 
matrix A whose vth column A, is given by 
cos 37 
3um 


COS = 
A, =C(v) aan e (10.22) 


(2N—l1)va 
Cos —TN 


Then (10.20) and (10.21) can be written simply as 


¥= A'x (10.20') 
x= A® (10.21') 


Combining (10.20’) and (10.21’) shows that x = AA’x for each x € R". Thus, 
AA! = Iyxn, and hence A~! = A‘. A matrix of real numbers with this property 
is known as an orthogonal matrix. Their columns (and rows) form a set of 
mutually orthogonal unit vectors. 

From a linear algebra perspective, the Fourier transform process is equiv- 
alent to a change of basis—the new basis vectors are just the columns of the 
transform’s matrix W (cf., Section 10.2). In similar fashion, the cosine trans- 
form is a change in basis, the new basis vectors being the columns of its ma- 
trix A. When the vth column A, of A is extended to a function on Z, i.e., 
Ay(k) = C(v) cos[(2k + 1)um/2N] for k € Z, then A, is periodic with period 
2N and its frequency, v/2N, increases with the (column) index v. This orders * 
compatibly with the frequencies of x, i.e., K(0) is the amplitude of Ao, the low- 
est frequency component of x, X(1) is the amplitude of A;, the next-to-lowest 
frequency component, and so on with X(N — 1) the amplitude of Ay_1, the 
highest frequency in x. This ordering is in contrast with the Fourier transform 
of x where the middle entries of X give information about its high-frequency 
components, cf., footnote 14 on page 255. 


25 Also, if x has nonzero imaginary part and extended in this manner, then X will also have nonzero 


imaginary part. This gives a partial answer to the question posed in Section 10.2.1 concerning the 
form a real vector must have to be transformed back to a real vector. 
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Figure 10.7: The odd extension of a signal to all of k =0,...,2N. 


Sine transforms In a similar fashion, sine transforms can be developed by 
applying the Fourier transform to an appropriate extension of x. For example, a 
sine transform can be obtained by setting x(V) = 0 and defining 


x(N +k) := —x(N —k) (10.23) 


fork =1,..., N. Figure 10.7 contains a picture of this odd extension of x. The 
equations for the corresponding sine transform are 


RW) = =) x(k) si oe. v=0,...,.N—-1 (10.24) 
eet DOTY = 
x(k) =,/ > Rv) sin —T >, k= 0,...,N— 1. (10.25) 


The NV x WN sine transform matrix B has its vth column given by the vector 


“x (v+l) 
panes 
5) n 2z+D 
NHI 
B, = ,/ ——_ , 10.26 
""VN+I1 ( ) 
sin M5) 


B is real and symmetric, so the transform equations (10.24) and (10.25) take 
the form 


¥ = Bx (10.24') 
x = BX. (10.25) 


These two equations imply that B* = Jy, and, hence, that B is its own in- 
verse.*© Exercise 4 has more on the sine transform. 

The cosine transform controls high frequencies resulting from endpoint dif- 
ferences, but Figure 10.7 suggests that a sine transform could possibly exacer- 
bate the problem, introducing high-frequencies not only at endpoints but across 


26Tf B is the matrix E 4 then B2 = Inx2. Thus, a matrix B with B* = Iy yy can be quite 


different from the identity. 
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As an example we compute cosine and sine transforms of the sample vector 
x from last section and compare them with each other and its Fourier transform 
Xx. For convenience, x and the magnitudes (amplitudes) of its Fourier transform 


are reproduced here: 


0.0 
25 
5.0 
12.5 
20.0 
20.0 
27.5 
35.0, 


Abs(x) = 


43.31 
18.88 
9.88 
8.57 
6.18 
8.57 
9.88 
18.88 


Using Cx and Sx to denote the cosine and sine transform of x, respectively, with 
x reserved for its Fourier transform, then (10.20) and (10.24) imply 


43.31 
—32.46 
2.11 
—2.67 
4.42 
—2.04 
—1.83 
0.97 


Cx= 


and Sx= 


40.03 
—29.54 
13.27 
—11.88 
11.05 
—7.14 
1.64 
—0.71 


Entries in Sx, like the cosine transform, are order compatible with increas- 
ing frequencies of x. Thus, in this one case anyway, of the three transforms, the 
cosine transform seems to be the winner if the race is to represent x with small 


high-frequency terms. 


10.3.1 A general orthogonal transform 


The Fourier, cosine, and sine transforms are all examples of orthogonal trans- 
formations. Each could have been developed starting with an appropriate or- 
thonormal basis for C, i.e., a set of vectors {e9,e;,...,ev_1} C C% with 


ane 1, ifu=v, 
weer 10, ifufév. 
Here (e,,,€y) denotes the usual inner product of e, and e, in Cy, 
N-1 
(€u,€v) = De ey (key (Kk). (10.27) 
k=0 
For example, the Fourier transform on C has basis vectors {Wo,..., Wy_1} 


from Section 10.2. 
If {eo,...,e—1} is a basis for C, then for each vector v € C there is a 
unique vector V € C such that 
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N-1 
v= >) ¥(je;. (10.28) 
j=0 
We call ¥ the transform of v with respect to the basis {e9,e1,...,e€—1} and to 


compute V requires solving the above linear system. 

Define an N x N matrix E by letting its kth column be the (column) vector 
e,. Thus, E = [eg e; --- ey_;] and v= EV. Suppose now, in addition to being 
a basis, the set {e9,...,ey—1} is also orthonormal. Then the columns of E are 
orthonormal, 1.e., EE = Inyn, hence, E —— E and the relationship 


V=EV (10.29) 
v=EV (10.30) 
holds for all ve C% 2” To obtain the Fourier, cosine, and sine transforms from 


these general transform equations, take the basis vectors ex, to be Wx, Ax, and 
Bx respectively, i.e., see 10.9, 10.22, and 10.26. 


10.3.2 Summary 


At this point in the chapter our signals x € R™ have been one-dimensional; the 
kth component x(k) of x the value of whatever it is we’re recording at time k 
(a good example might be a simple audio signal sampled N times). But x can 
be just about any ordered list of data, for instance, it could represent a sequence 
of N daily observations of the snow depth at the Alta ski area in the Wasatch 
mountains near Salt Lake City, Utah. The transforms that have been discussed 
are not concerned with the physical nature of the vector x. 

The Fourier, cosine, and sine transforms each take as input a vector x and 
output a vector X whose components x(k) contain information about the funda- 
mental frequency make-up of x. The use of any of these classical transforms 
can be thought of as a decomposition process, i.e., a process of breaking up a 
signal into fundamental frequency “pieces,” and each of them holds a steady 
place in the general study, description, and further analysis of signals. 

We’ve also indicated how the information in X can help to define a “com- 
pression” of x, provided we are willing to allow some error into its reconstruc- 
tion. In the next section our goal is to extend these one-dimensional tools to 
two-dimensional signals; i.e., to images. 


Exercises 10.3 


1. Compute cosine and sine transforms, Cy and Sy, of the sample vector y of 
Section 10.2.1 and compare results with each other and y. 


274 matrix E satisfying E' = E~! iscalleda unitary matrix. If E is also real then it is called an 
orthogonal matrix. 
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2. If Eis an NxWN matrix whose columns are orthonormal, then its rows are, 


too. Hint: EE = Iyxn implies that E =E7!. 


. This exercise develops the cosine transform from the Fourier transform. 


Ifx = (x0,...,XN-1) € IRY then we’ve seen that x and its Fourier transform 
X are related by 


N-1 

X(v) = = = xikje Ors (10.31) 
k=0 
1 N-1 

x(k) = Ti Sane. (10.32) 
v=0 


Extend x to a larger interval k = 0,...,2N — 1 by defining 
X(N +k) =x(N —- (k+ 1)) (10.33) 
and think of x now as a vector in R*. Figure 10.6 is helpful here. 


(a) Transform x (remember, it’s now in R?”) and use (10.33) to show that 


N-1 
= [2 wiv/2N (2k + l)um 
x(v) ra >. x(k) cos a ee 
k=0 
Hint: Apply (10.31) to obtain X(v) = (1/2) yal Ly ke tike/N 
Split the sum as into ey roe and, in the second sum, 


use the extension definition x(N + v) = x(N — (v+ 1)) together with 
the fact that cos@ = (e!? +e!) /2. 


Remark: The right-hand side of this equation is defined for any integer 
v and has period 2N, providing an extension of x from v = 0,...,2N — 
1 to all of Z with the property X(v + 2N) =X(v). 


(b) Define y(v) = e~7!"/?" X(v), for v € Z so that the result in part (a) can 
be written as 


oa ZO (k)co ae v=0,...,.N—1, (10.34) 


Now show that 


(i) when x is real, so is y, 
(ii) y(N) = 
(iii) and, for all v € Z, y(—v) = y(v) and y(u+2N) = —y(v). 


(c) Argue that x can be recovered from y from the relation 


1 ai 2 (2k + luz 
x(k) = AVOt dX VY¥(v) cos —. (10.35) 
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Hint: From (10.31), 
2N-1 


2nikv/2N 


x(k) = X(v)e 


1 2N-1 
= 2 errs jerre* 
V2N 1-5 


so proceed with an argument similar in spirit with that of part (a). 


(d) By defining ¥(0) = y(0)/V2, C(0) = /T/N and ¥(v) = y(v), C(v) = 
J2/N forv=1,...,N —1, rewrite (10.34) and (10.35) as 


N-1 


fhe. (2k + l)ux 

¥(v) = 2 CON 
N-1 

x(k) = DFC Ue08 a 


Relabeling Y with X gives the symmetric discrete cosine transform 
(10.20), (10.21). 


4. Consider the sine transform on C™ defined by (10.24) and (10.25). 


(a) The right-hand sides of these two equations are defined for all integers 
Z. Show that these extensions of x and x have period 2N +2. 

(b) Show that x(V) = 0 and that x(2N+ 1) =0. 

(c) Show that x(V +k) = —x(N —k), fork =0,..., N (see Figure 10.7). 


5. Show the general orthogonal transform defined in Section 10.3.1 is an 
isometry on CN, i.e., if Vis the (orthogonal) transform of v then |[¥|| = ||v||. 
This shows, at one stroke, that the Fourier, cosine, and sine transforms are 
all isometries on either C% or RY. 


ee EEE——————Err rrr 
10.4 Two-dimensional transforms 


The JPEG image compression scheme employs a 2D cosine transform as part of 
its specification. In this section we develop the 2D cosine transform as a special 
case of a more general type of 2D transform. Section 10.5 discusses the role 
this cosine transform plays in the JPEG compression scheme. 

The Fourier, cosine, and sine transforms are, at this stage, one-dimensional 
orthogonal transforms. However, each can easily be extended to a two dimen- 
sional transform. A separate 2D extension process could be argued for each one, 
but it’s more efficient (and insightful) to do the argument only once, obtaining 
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a general two-dimensional orthogonal transform. The 2D Fourier, cosine, and 
sine transforms will then follow as special cases. 
There are three ways we’ve looked at transforms, all of them equivalent: 


1. as a list of equations giving explicit instructions on how to compute each 
component of the transformed vector, e.g., (10.7) and (10.8), 

2. as an operator or matrix expression, e.g., (10.7’) and (10.8’), or 

3. as a change of basis, e.g., (10.7”) and (10.8”). 


The last approach emphasizes a basis choice and defines the path we’ll follow 
here. 

Basis elements for 2D orthogonal transforms can be constructed from the 
basis vectors of orthogonal one-dimensional transforms. The procedure starts 
with a basis of mutually orthonormal N-vectors, {e9,e1,...,ev—1} C C’. Thus, 


fe,,e,)={b fuse, 
w= 10, ifudéy, 


where the inner product (-,-) is defined in Subsection 10.3.1. For each pair of 


indices u and v from {0,..., WN — 1} we can define an Nx N array fi», whose 
entry in the jth row and kth column is given by 
Suv .k) =eu(evk), O< 7, KS N-1. (10.36) 


There are N? of these matrices and they turn out to be mutually orthogonal 
N2 : 
when regarded as members of C’, i.e., 


N-1 
hihi = Y- fi hv) 
j.k=0 
N-1 
= VS eu(ev®Mew (ev 
j.k=0 
N-1 N-1 
= Veu(ewG) >, eve®ev(k) 
j=0 k=0 
= een = 1, ifu=u' andv=v’, 
were <v! "10, otherwise. 
This computation also shows that each f,,, has unit length, i.e., that || fy» || = 1. 


The collection { fy, | 0 <u,v < N—1} of NxXWN arrays forms an orthonormal 
subset of C%” and we'll use it as a basis for the space of all N x N arrays of 
complex numbers. 

Now let f be an Nx N matrix (think of f as a 2D signal—an image). The 
transform of f, corresponding to the NV 2 basis matrices {fuv |O<u,v<N-I}, 
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is another N x N matrix Pe with entries flu, v) defined by the equation 
N-1 


f= >" FU ha: (10.37) 


u,v=0 


We’ve seen this kind of equation before (cf., (10.28)). It uniquely defines a 
but it doesn’t explicitly tell us how to compute the entry in its wth row and vth 
column, 1.e., the entry fl (u,v). This is again a linear algebra problem (with N? 
unknowns fl (u,v),0<u,v < N —1). To solve it we’ll again exploit orthonor- 
mality of the construction material, i.e., of the basis matrices { fy}. 

Pick a row index uo and a column index vg. To compute the entry fluo, v0) 
in the array # proceed in the usual way by taking the (CY " inner product of 
both sides of (10.37) with fuguy: 


N-1 
(f, Fugu) = ( ys Fe.) fe Soo) 


u,v=0 
N-1 


= nk fu, v)( favs fur) = f Mo, V0)- 


u,v=0 


Thus f (uo, vo) = (fy fugvo) 1-€., the transform coefficient f (uo, vo) is just the 
inner product of f with the basis array fijvo. Expanding this inner product 
results in the transform formula 
7 N-1 
fo.v0) = >) £4) Fu G4: 
ik=0 
Since uo and vo are arbitrary row and column indices, then, together with the 
system (10.37), we have a general transform pair 
7 N-1 
fun= > fGOfoG), O<uv<N-1, (10.38) 
ik=0 
N21 5 
FUD= > FU fuG.O, O<7k<SN-1. (10.39) 


u,v=0 


10.4.1 The 2D Fourier, cosine, and sine transforms 


In this section, we apply (10.38) and (10.39) to the development of two-dimen- 
sional versions of the Fourier, cosine, and sine transforms. 


The 2D Fourier transform Recall, from the latter part of Section 10.2, that the 
columns from the Fourier transform transform matrix W form the basis vectors 
corresponding to the one-dimensional Fourier transform. The kth column W,; 
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1 
e2mik/N 
1 e2mi2k/N 


e2mi(N—Dk/N 


Using (10.36), with W, in place of ex, we can construct the basis matrices fy,» 
for the 2D Fourier transform: the (j,k) entry in is is given by 


fuv ik) = eu( ev) = Wi(j)Wo®) = rg gr rN 


It follows that fyy(j,k) = /N)e27!U4-k)/N and from (10.38) and (10.39), 
the two-dimensional Fourier transform takes the form 


N-1 
flu,v) =~ SFG be ee (10.40) 
j,k=0 
1 Net - 
fGH= = > Famer UN, (10.41) 
u,v=0 


We’ve emphasized before how the one-dimensional Fourier transform is really 
a map between periodic sequences. The exponential functions in the right-hand 
sides of equations (10.40) and (10.41) give us a 2D analog: they are defined for 
any pairs of integers wu and v or j and k, and, as such, extend definitions of both 
‘a and f to all of Zx Z = Z?. Moreover, the extension is periodic with period 
N in both directions. 

In the same manner then that the one-dimensional Fourier transform sees 
signals of length N as single periods of periodic signals of period N, the 2D 
Fourier transform sees both Fe and f as maps from Z? to C with the property that 
FUN, v) = flu,v+N) = fu,v) and fG+N,K) = fGKEN) =f G0 
for any (u, v) or (j,k) in Z. 


The 2D cosine transform From Section 10.3, the uth basis vector e, corre- 
sponding to the 2D cosine transform is the vth column of the cosine transform 
matrix A. Thus in (10.36), eg = Ax where 


Ag =C(k) 


(2N—lkx 
cos —tN 


for 0 <k <N—1 with C(0) =1/VN and C(k) = /2/N if k £0. The (j,k) 


© 2003 by CRC Press LLC 


10.4 Two-dimensional transforms 277 


entry in the basis matrix for the 2D cosine transform then looks like 


. — (27+ 1l)ux (2k + l)um 
fuv(j.k) = Au(j)Av(k) = C(u) cos Ce) cos 
Equations (10.38) and (10.39) then give us the two-dimensional cosine trans- 

form pair 
N-1 : 
es _ . (27+ Dux (2k + l)um 
ft,v)= pa fGU,/C(u) cos aN C(v) cos aN (10.42) 
j.k=0 
ie (27+1)ux (2k + l)ux 
fGH= 2 ile v)C(u) cos oy £ fu eos NO (10.43) 


Once again we emphasize that these equations extend f and f to two-dimen- 
sional periodic signals (now with period 2) defined on Z?. This is the global 
signal the cosine transform “sees” when given either for f. 

The extension provided by (10.43) is also even, in the sense that it extends 
f to an even function in both horizontal and vertical directions. This extended 
f is generally smoother than that provided by the 2D Fourier transform or sine 
transform. To visualize this extension, picture f as an array and then reflect it 
across each of its four boundaries. This creates four more arrays, each having 
the same dimensions. Now reflect each one of these “new” arrays across their 
boundaries and so on. Continuing this tiling process throughout Z yields the 
signal that the 2D cosine transform regards as f. 

To gain further insight into the 2D-cosine transform, rewrite (10.43) as 


N-1 
f=) ( fGBa (10.44) 
u,v=0 
where B,,, is the N x N basis element whose entry in the jth row and kth column 
is 
(27+ lun (2k + l)um 
—— C(v) cos —————_.. 

2N 2N 
The collection {Byy | 0 < u,v < N —1} contains the building blocks used to 
construct f. Each B,, can be regarded as a basic image element. 

A greyscale image can be represented as a table of grey-levels, i.e., an array 
of integers. Conversely, an array of integers can be thought of as a table of grey- 
levels. Figure 10.8(a) shows the 16 basic image elements B,,, arranged in an 
array indexed by u and v, corresponding to a 4x4 transform. Any given 4x4 
image can be written uniquely as a linear combination of these basic images. 
The (transform) coefficient flu, v) is a measure of the “presence” of B,, in the 
overall image. Figure 10.8(b) is an illustration of equation (10.44), showing, in 
this case, a randomly generated 4x 4 image being built up in stages by linear 
combinations of basic images.”8 The partial sums of (10.44) are formed by 


Buv(V, k) = C(u) cos 


28Figure A.1 in Appendix A shows a similar example with an 8 x 8 image. 
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(a) 4x4 basic images (b) sample image and the 16 partial sums 


Figure 10.8: Image elements for the 2D cosine transform (N = 4), sample 
image, and the 16 partial sums. 


Original 


Figure 10.9: Partial sums build up to the original image. 


following a zigzag sequence through the basis images; Figure 10.12 indicates 
this type ordering for the 8 x8 case. 

In another example, Figure 10.9 shows how the cosine transform might be 
applied to an image with 32 x 32 pixels. In this case there are 32 x 32 = 1024 
basic image elements (the B,,,s) so it’s impractical to display them all in an array 
setting like Figure 10.8(a). We can still “plot” some of the partial sums though. 
The original image is on the right and is a representation of the complete sum 
in (10.44). The three images to its left each correspond to a certain fraction of 
this sum; that is, they are partial sums. The leftmost image contains the first 
1/4, or 256, terms of the sum, the next image contains an additional 256 (the 
halfway stage), and the next adds 256 more. Scanning from left to right in the 
figure illustrates how the image sharpens as more and more basic image terms 
get added to this “running” sum. 


The 2D sine transform A two-dimensional sine transform is left as Exer- 
cise 10.4.3. 
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10.4.2 Matrix expressions for 2D transforms 


Not surprisingly, the general 2D transform equations (10.38), (10.39) have ma- 


trix forms: 
f=EfE (10.38) 
f=EfE. (10.39’) 
E has for columns the orthonormal vectors eg, k = 0,1,..., NM —1. Any such 


matrix FE’, whose columns form an orthonormal set of Vectors: is called a unitary 
matrix (orthogonal, if EF i real) and has the property that EE=EE =I NxN> 
or equivalently, E~! =F. 

The two-dimensional versions of the Fourier, cosine, and sine transforms 
have matrix representations of this kind and are quickly obtained from (10.38’) 
and (10.39’). Before we do this, we need to check that (10.38’) and (10.39’) 
are correct (we haven’t done this yet!). This turns out to be, for the most part, 
bookkeeping, but we feel it’s instructive. To proceed we’ll start with the basic 
transform (10.38) 


flu,v) = ys fG.) fuvG) 


J .k=0 


N-1 N-1 N-1 
= FU. HeDe® => ead YG. Hek. — () 


ik=0 
The inner sum of the above expression can be interpreted as the entry in a matrix 
product: 
N-1 
f (i, Kev(k) = [jth row of f] vth 
k=0 column |= [FE] (j,v) 
of E 


where [ fE | (j, v) denotes the (j, v) entry of the (matrix) product fF. Inserting 
this expression back into («) gives 


N-1 
fur = ea@[fE]G.») 
j=0 
= [uth column of E]’ uth 
column 
of fE 


= [uth row of E’ vth 


chimps] 
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Thus, f= E fF, forany NxN matrix f. This is, of course, equation (10.38’) 
and, since E is unitary, it easily inverts to equation (10.39’). 

The matrix equations (10.38’), (10.39’) provide clean descriptions of the 
2D Fourier, cosine, and sine transforms and are found below. However, im- 
plementation of these transforms are usually coded in a more efficient manner 
(cf., the FFT in Exercise 10.2.12). Nevertheless, the matrix forms above do 
make it easy to experiment with images, especially if using a matrix-oriented 
mathematical software package such as MATLAB or Octave. 


2D Fourier transform The one-dimensional Fourier transform matrix W in 
(10.8’) is symmetric, i.e., has the property that W‘ = W. From (10.38’) and 
(10.39), with E = W, we see that the 2D Fourier transform can be written as 
f=Wfw (10.40’) 
f=Wfw. (10.41’) 
2D cosine transform The cosine transform matrix A is real (see Section 10.3), 
hence A = A. The matrix form of the 2D cosine transform is then 
f=AfA (10.42’) 
f =AfAt. (10.43’) 
2D sine transform The sine transform is even simpler. The matrix B corre- 
sponding to the sine transform (Section 10.3 again) is real and symmetric, i.e., 
B= B and B' = B. Thus, the 2D sine transform is 
f=BfB (10.45) 
f=BFfB. (10.46) 


Exercises 10.4 


1. Show that the process of cosine transforming an N x N image f is equiv- 
alent to first taking the 1D cosine transform of each column followed by the 
1D cosine transform of the resulting rows. Hint: f = A’ fA=(A'(A' f)')’. 


Does a similar process describe the 2D Fourier transform? The general 2D 
transform (10.38’)? 

2. In Section 10.4.1 the even 2D signal f seen by the cosine transform was 
described in a visual or geometric fashion. Describe the extension of f 
provided by the 2D Fourier transform. 

3. Show that the two-dimensional sine transform equations have the form 

N-1 


sao 2 ae 1 a ; 
OO Ta oe, Gia OS) gg er 


sin 
N+1 N+1 


amgtDutl . ck+De+)) 


FULD = a Fu nysin DD N+. "we 


u,v=0 
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eee 


10.5 An application: JPEG image compression 


The compression methods discussed in Chapters 5—9 can be used on image data. 
In fact, the popular GIF format uses an LZW scheme to compress 256-color 
images. Portable Network Graphics (PNG) is more sophisticated and capable, 
using a predictor (or filter) to prepare the data for a gzip-style compressor. How- 
ever, applications using high resolution images with thousands of colors may 
require more compression than can be achieved with these lossless methods. 

Lossy schemes discard some of the data in order to obtain better compres- 
sion. The problem, of course, is deciding just what information is to be com- 
promised. Loss of information in compressing text is typically unacceptable, 
although simple schemes such as elimination of every vowel from English text 
may find application somewhere. The situation is different with images and 
sound: some loss of data may be quite acceptable, even imperceptible. 

In the 1980s, the Joint Photographic Experts Group (JPEG) was formed to 
develop standards for still-image compression. The specification includes both 
lossless and lossy modes, although the latter is perhaps of the most interest (and 
is usually what is meant by “JPEG compression’). This section will consider 
only the ideas of the lossy mode, applied to greyscale images.”? 

The method in lossy JPEG depends for its compression on an important 
mathematical and physical theme: local approximation. Both mathematical and 
physical objects are often easier to understand and examine when analyzed lo- 
cally. The JPEG group took this idea and fine-tuned it with results gained from 
studies on the human visual system. The resulting scheme enjoys wide use, in 
part because it is an open standard, but mostly because it does well on a large 
class of images, with fairly modest resource requirements. 

The ideas can be illustrated with a greyscale image; that is, a matrix of 
integer values representing levels of grey. The range of values isn’t important in 
understanding the mathematical ideas, although it is common to restrict values 
to the interval [0,255], giving a total of 256 levels of grey. The ‘bird’ at left 
in Figure 10.10 shows an image containing 256 x 256 pixels with 145 shades of 
grey represented. 

Portions of this image appear to contain relatively constant levels of grey. 
Working locally, we could collapse these almost-constant regions to their aver- 
age shade of grey. Aesthetic questions aside for now, suppose we do this, that 
is, partition the 256 x 256 ‘bird’ image into 1024 8 x8 blocks and replace each 
of the 8 x8 pixel blocks with its average shade of grey. The resulting image ap- 
pears on the right in Figure 10.10. The original 256 x 256 array of numbers has 
been reduced to a 32 x 32 array, or to 1/64 of its original size (64 = 2567/32”). 

On certain (mostly uninteresting) portions of the image this simple method 
works quite well but, of course, considerable detail has been lost in several key 


29 See Section 10.7 for some remarks on color. 
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Figure 10.10: Block-averaging applied to ‘bird’. 


areas. This idea of working locally, though, does seem to have merit. However, 
as Figure 10.10 so clearly shows, at the very least it needs considerable refine- 
ment before it can be thought of as a viable method. We could refine the block 
size, 1.e., go to a smaller than 8 x 8 block, but doing so could sacrifice compres- 
sion; in fact, it would perhaps be better to use the largest block size we could get 
away with.*° It’s tempting to imagine some sort of adaptive method that uses 
large blocks when possible but goes to smaller blocks in image areas of high 
detail,>! perhaps carving the picture up into odd shapes (like a jigsaw puzzle) 
as the method progresses through the image. However, unless done elegantly, 
this complication could add considerable baggage to the information required 
for image reconstruction and thus prove self-defeating. Instead of block size 
modification, JPEG simply chooses to preserve more detail in an 8x8 block 
whenever it determines detail is too important to throw away. 

The “detail detector” built into JPEG is the 2D cosine transform of Sec- 
tion 10.4. The cosine transform (or, for that matter, any Fourier transform) 
exchanges raw image (spatial) information directly for information about fre- 
quency content. An 8 x8 block is built up with basic 8 x 8 cosine block images 
of increasing detail. There are 64 of these image elements, each of which is 
displayed in Figure A.1 of Appendix A. 

Figure 10.9 illustrates approximation by sums of the basic cosine block im- 
ages (N = 32) described in that section. In the sum, terms have been ordered 
so that the “tail” contains the high-frequency information. Roughly speaking, 
each successive term in the sum adds a little more detail. Stopping the sum 
at a certain point amounts to truncating subsequent (high) frequencies from the 
original block, and is equivalent to replacing the appropriate entries in the trans- 


30 Of course, in the extreme case of reducing block size down to 1 x 1, the image will not change 


nor will there be any compression. 
31 Tp a sense, wavelet techniques attempt this. 
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Figure 10.11: A schematic of the JPEG process. 


formed matrix with zeros. Discarding these trailing zeros and retaining only the 
nonzero coefficients corresponds to a compression method—and can even be 
considered a special case of JPEG.** 

JPEG exploits the idea of local approximation for its compression: 8 x 8 
portions of the complete image are transformed using the cosine transform and 
then the information retained in each block is quantized by a method which 
tends to suppress higher-frequency elements. Figure 10.11 is a schematic of 
JPEG and JPEG-like compression schemes. Below is a quick summary of the 
ideas behind JPEG. 


1. Work locally. Carve the image into smaller k xk blocks. In the case of 
JPEG, an mxn image is split up into 8 pixel by 8 pixel blocks, i.e., k = 8. 
These blocks are usually very small pieces of the entire image, e.g., in 
a 256 x 256 pixel image, an 8x8 block occupies only 100(87/256*) = 
.098% of the picture area. 


2. Transform. Each block is transformed to expose spatial frequencies (de- 
tail) within. JPEG uses the cosine transform and expresses the original 
image in terms of 64 basic “cosine” images of fixed horizontal and verti- 
cal spatial frequencies; see Figure A.1. 


3. Quantize. A “rounding” procedure is performed which reduces magni- 
tudes of the transformed coefficients. Typically more aggressive reduc- 
tion is performed on coefficients corresponding to high-frequency com- 
ponents. The coefficients which quantize to zero correspond to frequen- 


32The amount of compression is complicated by the fact that the entries in the original and trans- 
formed matrices are not in the same range, but the main idea is correct. Also, to be precise, the 
approximation is a special case of JPEG only if the image is 8 x8 and the entries in the quantizer 
can be chosen sufficiently large. 
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cies omitted during block reconstruction. This is the “lossy” step of 
JPEG. 


4. Encode. The output of step 3 is compressed with a lossless scheme. Huff- 
man and arithmetic coding are specified in the JPEG standard. 


We’ ve already said something about the benefits of working on small blocks 
within an image. But there is an inherent weakness in any local approach that 
does not take into account the rest of the image: if small blocks are processed 
one at a time, removing information in a way that ignores the rest of the im- 
age, then we shouldn’t be surprised to find discontinuities between neighboring 
blocks after the image has been, block by block, reassembled. The question is 
whether or not they are noticeable—they certainly can be. For example, in Fig- 
ure 10.10 these blocking artifacts can be seen just about everywhere (of course, 
the simplistic block-averaging scheme used there wiped clean all detail from 
every block). 

In searching for a tool that would selectively allow more detail to remain 
in a block the JPEG group found the cosine transform to have some desirable 
properties. It’s relatively easy to compute, depending only on the dimension 
of the block to be transformed, and is computed in the same way throughout 
the image.** The coefficients of a cosine transformed array are also arranged 
in a “natural” order from the low to high frequencies. However, it is perhaps 
the cosine transform’s “smoothing” effect, as much as anything, that helps us to 
see that the JPEG group made a good choice. Each of the Fourier transforms, 
including the cosine transform, views an image (signal) as defined everywhere 
on Z”, but the cosine transform does not generally introduce sharp transients 
the others may, cf., Section 10.4.1. This property allows for the design of reli- 
able quantizers and their stable implementation: if the cosine transform sees a 
block as containing high frequencies, then the high frequencies are likely to be 
genuine, that is, they probably haven’t been artificially introduced by the trans- 
form process itself. Typical images are largely continuous and locally smooth 
so this “extended vision” of the cosine transform often does a decent job at pre- 
dicting or guessing a few pixels into a block’s immediate surrounding.*+ Even 
so, the very act of quantizing transformed block by transformed block results in 
abrupt changes to these blocks, producing discontinuities that ultimately will be 
passed back onto the image. In practice, artifacts between blocks from a JPEG 
processed image are not very noticeable unless aggressive quantizing has been 
used. In this case further smoothing may be desirable.*> 

In the JPEG procedure, information is lost at the quantizing stage; the other 
steps are invertible.*° After the transformed coefficients of a block have been 


33The KLH transform depends on the data being transformed and requires on-the-fly adjustment 
from one block to the next. [2] 

4The “smoothing” effect of the cosine transform is discussed and compared with the Fourier and 
sine transforms in Section 10.3. 

3A smoothing strategy is discussed in Appendix A. 

36Th applications, there are also roundoff errors when transforming, but these are usually minor 
compared with the information loss from quantizing. 
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quantized, the original block cannot generally be recovered. This trade-off al- 
lows JPEG to obtain typical compression ratios of 20:1 or better with little no- 
ticeable image degradation. Compare this to ratios of, say, 2 or 3 to 1 for the 
lossless GIF or PNG methods. 

Generically, the word quantize refers to the process of slicing up or par- 
titioning continuous objects (intervals of real numbers for us) into sub-pieces 
(subintervals) and matching each sub-piece with some member of a discrete 
set. We do this all the time when working with numbers. Here are some ex- 
amples: rounding to the nearest integer, flooring, truncating a real number after 
its third decimal place, or replacing a positive real number with the integer part 
of its logarithm. In each of these cases, either the real numbers or some subin- 
terval of real numbers has been replaced by a discrete set together with a map 
containing instructions on which member of the discrete set we should assign to 
a given real number. If we were to sketch the graph of the map associated with 
any of the above examples we would see a series of “steps,” i.e., a step-function. 

JPEG’s scheme quantizes individual ranges of each coefficient in the cosine 
transform with high frequencies more aggressively quantized than low frequen- 
cies. The scheme can be described as follows: each 8 x 8 transformed block Tx 
is associated with an 8 x8 array q of positive integers—an array of “quantizers” 
referred to as a quantizing matrix. In the simplest case, the matrix q is fixed for 
each block in the image. Each entry in 7x is then divided by its corresponding 
integer entry in qg and the result rounded to the nearest integer. Provided the 
quantizer entries are large enough, the effect of this process is, quite frequently, 
a very sparse matrix. 

One quantizer that is frequently used with JPEG is the luminance matrix 

16 11 10 16 24 40 51 61 

12 12 14 #19 26 58 60 55 

14 13 16 24 40 57 69 56 

_ | 14 17) 22 29 51 87 80 62 
7 18 22 37 56 68 109 103 77 
24 35 55 64 81 104 113 92 


49 64 78 87 103 121 120 = 101 
72 92 95 98 112 100 = 103 99 


Each entry in this array is based on a visual threshold of its corresponding basis 
element, see Figure A.1 and [57]. The smaller entries of g are generally found 
in its upper left-hand corner and the larger entries in the lower right. In any 
(cosine) transformed block Tx, the “low frequency” coefficients are located to- 
wards its upper left-hand corner and the “high frequency” coefficients towards 
its lower right-hand corner. Thus, the effect of quantizing Tx with g is to sup- 
press the higher frequency signals in the original block x. The design of the 
luminance table qg is typical of other “JPEG quantizers.” 

After the quantizing step is finished, the entries in the output array are or- 
dered, from low to high frequency, trailing zeros are truncated, and the resulting 
string encoded. Figure 10.12 indicates the low to high frequency ordering of a 
transformed (and transformed-quantized) block. In this ordering, an entry cor- 
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Figure 10.12: Ordering of the block entries. 


responds to the amplitude of a frequency that is at least as high as the frequency 
of its predecessors. 

Quantizing considerably reduces the number of distinct values that quan- 
tized coefficients can assume. Repetition of values among quantized coeffi- 
cients is likely to be found. In practice, zero is the most commonly repeated 
value and it’s usually the case that all but a small handful of the 64 coeffi- 
cients in the block get quantized to zero. In such cases the block will have a 
string representation consisting of a few nonzero quantized coefficients delimit- 
ing “long” strings of zeros and, since high frequencies are targeted aggressively, 
trailed by zeros to the end of the block. There is no need to encode the trail- 
ing zeros, only to mark where they begin in the block. The typical result is a 
transformed-quantized block with a string representation containing far fewer 
than 64 coefficients. Moreover, a JPEG encoder, designed to exploit the form 
of this string, is waiting to compress it even further. 

The process (omitting the encoder) on a given matrix x follows the diagram 


transform quantize dequantize | ~ invert ~ 
x —> Tx — OTx — Tx—*x 


where T is the cosine transform, defined by Tx = A'x A with A, the 8x8 cosine 
transform matrix, given (to a few places of accuracy) by: 


35° 35 35 350 350 35355 
49 42 28) «=.10 —.10 —.28 —.42 —.49 
46 19 —.19 —.46 —.46 -.19 .19 .46 
42 —.10 —.49 —.28 .28 49 .10 —.42 
33:=.35 =—i35: “33 235°=.395:-=.35. 35 
28 —49 10 42 —42 —.10 49 —.28 
19 —46 46 —.19 —.19 46 —46 .19 
10 —.28 42 —49 49 —.42  .28 —.10. 


Let’s follow the process on a particular 8 x 8 matrix x, taken from part of the 
smooth background in the ‘Lena’ image in Figure A.3. The background in 
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‘Lena’ has little shade variation (translation: little or no high-frequency pres- 
ence), SO we expect x to compress well. x is at top left of the following dia- 


102 104 105 110 111 110 118 115 104 105 106 107 109 110 111 112 
104 106 107 103 107 106 109 108 104 105 106 107 109 110 111 112 
104 107 107 111 108 107 109 116 104 105 106 107 109 110 111 112 
104 104 110 107 113 113 112 111 104 105 106 107 109 110 111 112 
105 104 108 106 111 111 108 108 104 105 106 107 109 110 111 112 
104 107 108 109 109 107 107 108 104 105 106 107 109 110 111 112 
105 103 108 109 108 108 111 111 104 105 106 107 109 110 111 112 


104 107 106 109 107 112 105 111 104 105 106 107 109 110 111 112 
r| lion 
864.0 —17.0 —3.8 —3.4 0.5 -1.1 0.7 1.2 54 —2200000 0 
19 -—5.2 2.4 -0.8 —0.6 0.9 —3.8 3.0 0 0000000 
-—0.9 —2.6 2.7 -14 03 1.5 —2.5 —3.0 0 0000000 
—0.5 —0.9 -15 16-08 24-25 2.9 Q 0 0000000 
3.8 —4.5 —2.6 4.1 —1.2 -—0.6 1.6 —0.1 ae 0 0000000 
5.9 —6.1 —0.6 -15 14 3.5 -13 1.1 0 0000000 
2.5 —0.3 —0.3 -—3.3 2.6 —1.3 —1.9 —4.5 0 0000000 
10 -19 13-14 2.6 1.3 —0.2 —1.4 0 000000 0. 


The encoder receives its information from the sparse matrix QTx. The 
entries in the block QT x are ordered (cf., Figure 10.12) and the tokens: 54, —2, 
EOB are sent on to the encoder for further compression (FOB denotes an end of 
block marker). The information contained in this string permits the array OT x 
to be reconstructed exactly. At the receiving end, this is the only information 
there is about the original x. From this information, the decoder can produce 
the block X at the top right of the diagram. We remark that because 8 x 8 blocks 
make up such a relatively small piece of most images, the sparseness of OT x 
may be typical of a large fraction of the total number of blocks in an image. 

It’s interesting to note that approximation X has only shade variation in the 
horizontal direction and not the vertical—this observation is easy to spot from 
the transformed-quantized matrix QT x and the cosine basis image shown in 
the first row and second column of Figure A.1. The entry-by-entry difference 
between x and its JPEG replacement *X is 


2212) 2 2 fy 2 
@ 1 1 <4 52 =4 22 24 
© 2 4d wal a 
y-¥=| 9-1 4 0 4 3 «1-1 
Lat oar 2 tae 
@. 2 2 2 0 29 4 4 
Pe 2 Pe 6 a1 
6 2 0 2-2 2-6 <1 


The eye, of course, is the best device to measure this error. 


37Note that (OT)~! is used here to indicate dequantizing followed by the inverse transform; 
however, the quantizing step Tx ++ QT x is not invertible, so this is a slight abuse of notation. 
Also, the result of the inverse transform has been rounded. 
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Here is another example. The 8 x 8 block z in the top left of the following 
diagram is taken from the same ‘Lena’ image but from a region where detail is 


prevalent: 
150 151 155 169 164 149 156 171 146 158 170 168 158 153 159 168 
156 158 161 162 156 157 172 174 161 153 148 156 169 173 164 153 
161 145 150 160 164 175 168 155 158 150 145 153 167 173 165 155 


140 133 154 163 163 154 159 164 139 151 162 161 154 152 162 173 
140 151 163 156 145 156 168 172 139 150 160 158 151 150 161 173 
158 154 142 141 155 165 165 146 157 147 139 145 159 167 162 154 
154 142 142 154 161 152 149 157 159 148 139 143 156 164 159 151 
147 145 151 156 144 141 158 168 144 153 159 154 144 142 154 166 


1245.9 -37.3 0.5 -64 10.6 10.4 —-1.1 1.8 78-3 0 00000 
26.4 -—3.7 -7.7 -5.0 2.6 1.4 0.0 —0.5 2 0-1 00000 
—0.8 93 67 -94 13.9 —3.5 —5.7 -1.9 0 1 0 00000 
—1.8 5.6 4.1 14-55 —8 —4.5 -0.3 | 2 0 0 0 00000 

6.9 8.2 -34-31.7 64 13 07 05/7 ]0 0 0 -10000 

—10.6 1.0 -17.7. 12.9 18.4 -3.6 2.7 2.0 0 0 0 00000 

-0.9 41 -0.2 19.3 -6.7 1.8 —0.9 -1.7 0 0 0 00000 
4.8 0.2 —0.2 1.8 -0.6 0.3 3.7 —4.1 00 0 00000 


QTz is still sparse but a much longer string 


78, —3,2,0,0,0,0,—1,1,0,...,0,-1, FOB 
—— 
23 zeros 


is sent to the encoder. However, the JPEG standard requires the encoder to 
run-length encode the two stretches of zeros, so, in the end, the block will still 
compress well. Detail is more important in z than in x and, correspondingly, 
the JPEG process keeps more of it. 

Software from the Independent JPEG Group was used to compress ‘bird’ at 
several “quality” levels, and the results are displayed in Figure 10.13. The sizes 
are given in bits per pixel (bpp); i.e., the number of bits, on average, required to 
store each of the numbers in the matrix representation of the image. The sizes 
for the GIF and PNG versions are included for reference.*® 


Exercises 10.5 


1. Define a quantizing matrix 


3 5 7 9 
15 7 9 WU 
I=|)7 9 1 2B 


9 11 13° «15 


For each x below, compute the transformed matrix Tx and then the quan- 
tized matrix OT x = round(Tx./q), where we have borrowed the following 


38 bird’ is part of a proposed collection of standard images at the Waterloo BragZone, and has 
been modified for this textbook. 
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(a) original test image 
4.9 bpp GIF, 4.3 bpp PNG 


Sea aN 


between (a) and (e) 


(e) .16 bpp JPEG (f) difference 


Figure 10.13: GIF, PNG, and JPEG compression on ‘bird’. 
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MATLAB notation. 


(a) 


(b) 


(c) 


(d) 


(e) 
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If A is a matrix, then round(A) is the matrix obtained from A by round- 
ing each of its entries to the nearest integer. 

If B is a matrix of the same dimensions as A, then A./B is the matrix 
obtained by dividing each entry of A by the corresponding entry of B. 


160 160 160 160 
160 160 160 160 


X= 160 160 160 160 
160 160 160 160 
160 160 160 161 
xa | 190 160 161 162 
160 161 162 163 
161 162 163 164 
10 0 0 O 
r-|19 0 0 0 
160 160 0 O 
160 160 160 0 


It has been said that JPEG doesn’t do well on cartoons. Why would 
someone say this and what do they really mean? Is it a “true” state- 
ment? Does this block and how it transforms shed any light on the 
matter? The block in part (a) “transforms” well and you could easily 
find many like it in a cartoon image. Dequantize QT x to get TX, and 
compare with Tx. Now do an inverse cosine transform on TX. Does ¥ 
“look” anything like the matrix x you started with? 


54. 70 182 81 
BP 183 1 240 227 
33 106 61 = 167 
23 7 46 38 


From a compression viewpoint QTx doesn’t look very promising. 
What do you think happened? Hint: MATLAB’s rand() command was 
used to generate x. Is this result expected? 


Exchange the quantizing matrix gq with a more aggressive quantizer 
of your own design. Using it, repeat items (a)—-(d). For example, you 
could simply scale up g (multiply g by a number larger than 1), use a 
piece (corner) of the JPEG’s luminance matrix, selectively grab thresh- 
olds from the luminance quantizer by matching frequencies with the 
4x4 basis elements, or even make up one of your own. 
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2. (Project) The 8 x8 transform size in JPEG was chosen for several reasons, 
including hardware considerations and the desire to take advantage of local 
behavior. Larger transform sizes may offer the possibility of better com- 
pression at a given “quality” level, especially in high-resolution images. 
Section A.1 contains a simple example of using other transform sizes on 
the ‘Lena’ image. 


Can JPEG benefit from a larger transform size (ignoring hardware costs)? 
This is a more difficult question than can be answered here, but is well 
suited for experiment. Choose one or two test images, and compress with 
a “typical” JPEG 8x8 scheme. Then attempt to match the image quality, 
but obtain superior compression, with larger transform sizes. You will have 
to find suitable quantizing matrices for the larger transforms. In addition, 
you will need to determine a way to measure compression. This could be 
a simple counting of trailing zeros, a compression with some off-the-shelf 
lossless compressor on the output of your scheme, or a modification of 
the JPEG entropy coder (the first two of these are easy but not ideal; the 
third could involve some time). Include some samples, and write a short 
summary on the experiments. 


3. (Project) One troublesome aspect of JPEG-like schemes is the appearance 
of blocking artifacts. Section A.1 discusses a smoothing procedure pro- 
posed in [57]. In brief, the scheme on a specific 8 x8 block looks at nearest- 
neighbor block averages in order to adjust some of the low-frequency AC 
coefficients (subject to a certain clamping). Implement such a scheme 
(or adapt the supplied MATLAB scripts). There are several areas for ex- 
perimentation: the number of coefficients considered for smoothing, the 
clamping condition, and the polynomial approximation itself. Attempt to 
do better than the example given in Figure A.5. 


————= 


10.6 A brief introduction to wavelets 


The issue at heart in this chapter is really one of signal representation. From a 
compression point of view, we would like to represent a signal efficiently, that 
is, as a linear combination of basis elements using as few as possible. Since our 
signals are discrete and finite, then the signal representation problem is natu- 
rally modeled from a linear algebra approach with the problem’s mathematical 
setting being a linear space (of “‘signals”’) together with a suitable choice of ba- 
sis elements. The linear space needs to be large enough to include all signals 
we expect to encounter. For us, this has amounted to selecting R% (or C’) for 
some large enough integer N.*? Thus, the real artwork in the subject comes 
down to choosing basis vectors with which to describe the signals of the space. 


39an NxM array (image) can be identified with a vector in RMN, 
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As alluded to in the preface to this chapter, there are many choices. 

Features of a signal that we especially wish to examine can guide us in our 
quest for the “right” basis vectors. For example, our development of the Fourier 
transform basis vectors in Section 10.2 was, in a sense, a consequence of our 
search for basic “frequencies” with which we could resolve periodic signals. 
Before starting off on this search we needed to query just what it is we should 
regard as “pure” tones or signals of fundamental frequencies. Although there 
were details left to stumble over, once this question was answered, the hard 
work was really finished because it was at this point where the “linear alge- 
bra” machine took control and eventually led us to the (discrete) Fourier basis 
elements W;, k = 0,1,...,.N—1 described at the end of the section. Interest- 
ingly enough, and not obvious in the process of their development, the vectors 
{W;,} ended up being orthogonal. We’ve since seen that much of the mathe- 
matical convenience of the Fourier transform stems from the linear algebra of 
orthogonal expansions. However, requiring the basis to be orthogonal was not 
something we imposed and did nothing to guide us to these special vectors: the 
matter at issue was the synthesis of arbitrary signals with linear combinations 
of a small, fundamental set of signals with “known” frequencies; orthogonality 
was a bonus. 

The Fourier transform is a very important tool, indispensable in the realm of 
signal analysis. When used as a compression device, though, we might some- 
times wish it had the additional capacity of being able to highlight local fre- 
quency information—generally, it doesn’t. The coefficient of W; in the Fourier 
expansion of a signal may yield information about the overall strength of the 
frequency (vector) W, in the signal, but this information is global: even if a 
coefficient is substantial, it doesn’t normally give us any clue as to what time 
interval(s) over which the corresponding frequency is significant. 

As an example, consider a signal that is flat for a time, then rises to oscillate 
rapidly over a short period of time, and then again becomes flat.*? Omit a co- 
efficient from the Fourier transform of such a signal and you may have trouble 
reconstructing it well. In such a situation, it could be advantageous to have, at 
our disposal, basis elements that reflect this sort of localization property. Per- 
haps then we would need fewer of them to describe such a signal—certainly a 
desirable situation from a compression standpoint. 

The Fourier transform is a general signal analysis tool and as such it is not 
too difficult to find special cases where it may not be optimal to use. There have 
been attempts to adapt the Fourier transform to better handle local information. 
They stretch from JPEG’s approach at cutting signals into small pieces, pro- 
cessing them one at a time, to generating new basis elements from the Fourier 
elements by taking their product with smooth cutoff functions (“windowed” 
Fourier transforms, cf. [13]) to the study of wavelets. 

The interest in and use of wavelet transforms has grown appreciably in 
the recent years since Ingrid Daubechies [12] demonstrated the existence of 


40 Scanning left to right along a horizontal line in the bird image could yield such a signal. 
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-1 -1 


Figure 10.14: The Haar scaling function y and mother wavelet w. 


continuous (and smoother) wavelets with compact support.*! They have found 


homes as theoretical devices in mathematics and physics and as practical tools 
applied to a myriad of areas including the analysis of surfaces, image editing 
and querying, and, of course, image compression. 

Our goal in this section is to introduce a very simple wavelet family, the 
Haar wavelet, and apply it to an image compression problem. We’ll use the 
‘bird’ image from the last section as our test image and keep the presentation in 
line with the “linear algebra of orthogonal expansions” theme used in the rest 
of this chapter. The Haar example can be presented, somewhat superficially, 
without the theoretical structure necessary to understand wavelets more fully; 
for this same reason it is also an incomplete introduction to the method. We use 
it here mainly because of its accessibility and also because it really does work 
as an image compression device. However, the Haar wavelet is not nearly the 
whole story on wavelets and we refer the interested reader to several excellent 
sources on theory and further applications of wavelets, in particular see [13, 14, 
45,52,54,58, 68,78, 79]. 

Perhaps the simplest example of wavelets are the Haar wavelets. Start with 
the Haar scaling function 


1, if0<x <1, 


a {0 otherwise. Oe) 
The mother Haar wavelet, y, is defined by 


1, if0<x <1/2, 
w(ix)= 4-1, if1/2<x <1, 
0, otherwise. 


The adjective “mother” should become clearer presently. Figure 10.14 contains 
the graphs of these two functions. 
Note that both of these functions are “finitely” supported and orthogonal on 


41 The support of a function is defined as the closure of the set of points over which it is nonzero. 
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Figure 10.15: Wavelets Wo and VI. 


R. In fact, their common support is the interval [0, 1] and they are orthogonal 
there as well. The unit interval is the place where all activity occurs and from 
now on we confine ourselves to it. The wavelet yy takes on the two values | 
and —1 on [0, 1] so it seems natural to identify y with the ordered pair (1, —1). 
Over [0, 1], g is constant (= 1) and so could be regarded, by itself, as a basis 
for the real numbers R, or, if g is identified with the vector (1, 1), the pair 


wn={['}{ J] 


is an orthogonal basis for R?. 
Further subdivision of the unit interval [0, 1] allows us to generate an or- 
thogonal basis for R* by defining two additional wavelets 


1, if0<x <1/4, 1, if 1/2<x <3/4, 
vi@)=4-1, if1/4<x<1/2, wWi@=f-1, if3/4<x <1, 
0, otherwise. 0, otherwise. 


Graphs of these wavelets appear in Figure 10.15. Note the size of their sup- 
ports is half that of the mother wavelet y. In fact, these new wavelets are 
just offsprings of y in the sense that Wg (x) = w(2x) and wi} (x) = w(2x — 1). 
The interval [0, 1] is divided into fourths by the two wavelets Wo and Vi and 
we may think of yg, y, wd, and yj as the 4-tuples (1,1,1,1), (1,1,-1,-1), 
(1, —1,0,0), and (0,0, 1, —1), respectively. In this case the collection 


1 1 1 0 

11 1 1 -1 0 
{.,.W.Wo.Wjy= 1]/°}—-1)? Ol’ 1 
1 -1 0 -1 


is a orthogonal basis for R*. 
We can continue to add wavelets in this fashion. For example, add four 
more wavelets vo. wi, Wi, and V3 to the set {¢, vy, vg. vb by dividing the 
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unit interval into eighths and defining, for k = 0, 1, 2,3, 


1, if 2k/8 <x < 2k+1)/8, 
Ue (x) = —-1, if Qk+1)/8 <x < 2k+2)/8, 
0, otherwise. 


Observe that We (x) = w(4x —k) for k = 0, 1, 2,3. From inspection we can see 
that the set {g, y, es vi. Wo. Wi, v5, v3) forms an orthogonal basis for R® 
when each is identified with an 8-vector in the above way 


(9... Wy. V1 Wo WT Wa. 9) 


1 1 1 0 1 0 0 0 
1 1 1 0 —1 0 0 0 
1 1 —1 0 0 1 0 0 
_ 1 1 -1 0 0 -1 0 0 
= 1)’}-1]’ 0]? 1]? 0]? 0]? 1]? 0 
1 —1 0 1 0 0 =1 0 
1 -1 0 -1 0 0 0 1 
1 -1 0 = 0 0 0 -1 


The supports of these new wavelets We are half that of those just one “resolu- 

tion” level lower, that is, half that of the wy. If this set is used as a basis for R® 

then coefficients of the We yield information about local detail in a signal. 
Normalizing these eight vectors produces an orthonormal basis for R® 


(9/V8,4/V8, 19/2, V1 /2, Wo /V2, Wi /V2, 03/02, W5/V2}. 
The corresponding (wavelet) transform matrix (cf. Section 10.3.1) has the form 


1/V8 1/V8 1/2 0 1/Vv2 0 0 
1/V8 1/V8 1/2 0 -1/V2 0 0 
W/V8 1//8 -1/2 0 0 1/v2 0 
Ih= 1/J/8 1/8 -1/2 0 0 -1//2 0 
3 11¥8 -1/V8 0 172 0 0 if/2 
1/V8 -1/V8 0 1/2 0 0 —1/V2 
1/V8 -1/V8 0-1/2 0 0 0 1/Vv2 
1/V8 -1/V8 0-1/2 0 0 0 -1/V2 
In general, Haar wavelets of arbitrarily fine resolution can be generated 
from the mother wavelet y through dyadic shifts and scales of its argument. 
More precisely, for a non-negative integer k (resolution level) we can define a 
wavelet 


wi (x) = ¥(2*x — j), for j =0,1,...,2*-1. 
With this notation, vs = w and the 2‘*! vectors 
(p}ULWE [k =0, 1.0.45 7 =0,1,...,2" —1) 


form an orthogonal basis for ae Identifying them with (column) vectors in 


k+1 ae . . . . 
IR? then normalizing and using them as columns in a matrix Hy+ gives a 
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Figure 10.16: The 4x4 Haar basis elements. 


k+1 
wavelet transform (on R? 


) 
T= Hiv (10.48) 
v= Ay. (10.49) 


(eee 


10.6.1 2D Haar wavelets 


Because the Haar wavelets are orthogonal, the machinery for producing a 2D 
Haar wavelet transform from the 1D transform is already in place (cf., Sec- 
tion 10.4). If f is an array (an image) of size Re"? and #,41 1s the 1D Haar 
wavelet transform matrix of (10.48)-(10.49), then the 2D Haar transform f of 
f is given by 


f= Hi f Aen. 
The original image f is recovered from its transform f by 
f= Bef Hig: 


The basis arrays for the 2D Haar wavelet transform are given by the i) 
matrices 


DPT) CDW) WOE WOW, 


where 0 <v, vu! <k, 0 <u <2"—1,0<w' <2” —1,and0 < j, j/< 24t!-1 
are the row and column indices. 

The 16 basis images at resolution level k = | are shown in Figure 10.16. 
They form a 2D Haar basis for the set of 4x4 matrices. Compare these with the 
cosine transform elements in Figure 10.8. One can begin to see the formation 
of elements with localized supports even at this “coarse” resolution level. 


© 2003 by CRC Press LLC 


10.6 A brief introduction to wavelets 297 


Image compression with wavelets 


When an image is expanded with Haar wavelets, the coefficient of the “scaling” 
array p(j)(j’), 0 < j, j’ < 2‘ — 1, is just a scaled average of all values within 
the image. Coefficients of the other 2D wavelet arrays in the expansion are 
usually called detail coefficients.*” The simple (lossy) compression scheme that 
we’ ll describe here is not as elaborate as the quantizing scheme used in JPEG. 
Basically, we throw away any detail coefficient meeting a very simple criterion: 


(1) start with an image f and a tolerance, say € > 0, 


(2) wavelet transform f to fand replace with zero any coefficient in f whose 
magnitude is less than e. 


More elaborate criteria can also be used.*? In Figure 10.17 we have used 
this simple scheme on ‘bird’, at several tolerance settings. Compare with Fig- 
ure 10.13, where JPEG has been used at different settings to compress this same 
image. Setting a coefficient to zero in the transformed image is equivalent to 
eliminating the corresponding basis array in the expansion of the image—it’s 
another way of saying that that particular basis element is not thought impor- 
tant enough to keep in the expansion of the overall image. 

Unlike JPEG, wavelets have been presented as a method that transforms 
the entire image at once, not a “block” at a time. Pedagogically this makes for a 
clean description of the process even though it may not always be the best way 
to think about it. Also, this approach can involve fairly large matrices. However, 
wavelet matrices are generally quite sparse and not too taxing for machines of 
even modest performance. Even so, for applications where speed is important, 
e.g., motion, fast wavelet transform algorithms exist, cf. [54]. 

Figure 10.17 illustrates a certain kind of simple-minded partial sum (pro- 
jection) approach to compression. Examples of more sophisticated wavelet 
schemes appear in Figures 10.18 and 10.19. These were generated using Geoff 
Davis’ Wavelet Image Compression Construction Kit (see Appendix C). Davis 
cautions, “The coder is not the most sophisticated—it’s a simple transform 
coder—but each individual piece of the transform coder has been chosen for 
high performance.” Figure 10.18 uses a Haar wavelet scheme, and Figure 10.19 
uses a wavelet family from [3], which is the default in the Kit and different from 
the Haar wavelets. 


Exercises 10.6 


1. Apply the Haar wavelet transform to each of the matrices in Exercise 10.5.1. 
Choose some threshold of your own and discard the appropriate coeffi- 
cients. Now invert and compare your results with the JPEG results, in 
particular, on the array in part (c). 


This terminology is analogous to the DC and AC coefficients of the cosine transform mentioned 


in Appendix A. 
4311 and I? schemes are discussed in [15] and [68]. Coefficient quantizing schemes are also used. 
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(a) original, 97% nonzero 


(e) 4% nonzero (f) 1.6% nonzero 


Figure 10.17: ‘bird’ (256x256) using Haar wavelet transform with simple 
thresholding. In (b)-(f), the percentage indicates the number of nonzero coeffi- 
cients in the transformed array after a threshold condition has been applied. 
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(c) .34 bpp (d) .16 bpp 


Figure 10.18: ‘bird’ using Haar wavelet from Davis’ Construction Kit. 


2. Find constants cy, so that the Haar scaling function ¢ in (10.47) satisfies the 
scaling equation 


$(x) =) )cnG (2x —n). 


neZ 


Solutions to this equation are known as scaling functions and used in the 


construction of wavelet families. Hint: Think geometrically, i.e., look at 
Figure 10.14. 
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(c) .34 bpp (d) .16 bpp 


Figure 10.19: ‘bird’ using Davis’ Construction Kit with a wavelet from [3]. 


10.7 Notes 


On color The discussion of JPEG and wavelets has centered on greyscale im- 
ages. Color images may identify a red, green, and blue triple (R, G, B) for each 
of the pixels, although other choices are possible. Color specified in terms of 
brightness, hue, and saturation, known as luminance-chrominance representa- 
tions, may be desirable from a compression viewpoint, since the human visual 
system is more sensitive to errors in the luminance component than in chromi- 
nance [57]. Given a color representation, JPEG and wavelet schemes can be 
applied to each of the three planes. 
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On JPEG, fractals, and wavelets JPEG enjoys an open, freely usable standard 
(other than the arithmetic coding option). This is a significant advantage in 
comparison with fractal and wavelet methods. To displace JPEG will require a 
scheme with real gains in speed, compression, and/or quality. 

Fractal methods were advertised as one such scheme, and very good com- 
pression has been achieved on some test images. There are other interesting 
features (such as a certain resolution independence), although the scheme can 
be somewhat intractable in practice, and the early claims for fractal methods 
were probably exaggerated. Yuval Fisher [20] offers a more cautionary assess- 
ment, noting two deficiencies: the encoding is computationally expensive and 
the encoded image gets very large as perfect reconstruction is approached. 

Wavelet and fractal methods are often said to be superior to JPEG at low bit 
rates, although this generalization needs to be qualified. Tom Lane, organizer 
of the Independent JPEG Group (IJG), writes: 


...the limitations of JPEG the standard ought not be confused with the limi- 
tations of a particular implementation of JPEG. There are hardly any JPEG 
codecs available that are optimized for very low quality settings. Certainly 
the IJG code is not (though I hope to do something about that in the next 
release). You may see a lot of blockiness in the current IJG encoder’s 
output at low [quality] settings, but you should not conclude that JPEG is 
incapable of doing better than that.44 


The problem, in part, is that the quantization is generally done by simple scaling 
of the suggested JPEG quantizing matrices. In an earlier post, Lane remarked: 


...the usual technique involves scaling the sample tables mentioned in the 
JPEG standard up or down by some given ratio. This works [reason- 
ably well] for scale ratios around 0.5 to 1.0 (that’s Q 50 to Q 75 in the 
JJG software, for example) but loses badly at much higher or lower set- 
tings. The spec’s sample tables are only samples anyway—much more is 
known about quantization-table design now than was true when the spec 
was drafted. Not a lot of that research has propagated into shipping prod- 
ucts, though.*> 


In addition, compression improvements could be obtained on the encoding side 
without breaking existing decoders: 


...Just because the decoder will reconstruct DCT coefficients [with simple 
multiplication] doesn’t mean the encoder must form the encoded values 
by simple division. Adobe’s (formerly Storm’s) encoder uses this idea a 
little bit, but it could be taken much further. In particular, you can do “poor 
man’s variable quantization” this way, without breaking compatibility with 
existing decoders, just by zeroing out coefficients that shouldn’t be zero 
according to a strict encoder. 


44 
45 


comp.compression newsgroup post, 30 Apr 1997. 
comp.compression newsgroup post, 2 Sep 1996. Quoted by permission. 


© 2003 by CRC Press LLC 


302 10 Transform Methods and Image Compression 


This kind of adaptive quantization is discussed [57]. The problem of block- 
ing artifacts could be addressed on the decoding side, possibly along the lines 
outlined in [57] and Appendix A. 

One can imagine having three open standards, and, perhaps with some hu- 
man intervention, choosing the best scheme for a given image and given “qual- 
ity” criteria. However, the climate has changed somewhat since the JPEG stan- 
dard was developed, now that companies have discovered that the US Patent 
Office will (in essence) grant patents on algorithms. 
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JPEGtool User’s Guide 


This appendix describes the “JPEGtool” package of scripts used to study aspects 
of JPEG (or JPEG-like) image compression. Scripts for Matlab! and Octave* 
(and an optional Maple? script) are provided which perform, for example, an 
NxWN discrete cosine transform of a matrix (image) and quantization. Section 
A.1 illustrates the use of the package, and Section A.2 provides a synopsis. 

At the simplest level, a standard JPEG transform and quantization scheme 
can be requested with a command of the form 


jpeg (’bird.pgm’ ) 
The result can easily be displayed on-screen with standard Octave or Matlab 
commands. More interesting use of the package includes display of partial 


sums (as matrices or images), experiments with the transform size N and the 
quantization matrices, and “smoothing” filters to reduce blocking artifacts. 


Requirements 


Matlab or Octave is required. Matlab is available for Macintosh computers, 
OS/2, Microsoft Windows, and many Unix-like platforms including Linux. Stu- 
dent editions for some platforms are also available, although these versions may 
place low ceilings on the size of matrices which can be manipulated. Section 
A.3 contains information on obtaining Octave. 

Maple was used to illustrate the calculation of coefficients in the smoothing 
program (which attempts to remove blocking artifacts), but it is not required. It 
should be routine to convert this for use with one of the other mathematical 
packages (MuPAD‘ is a possibility). 


Installation 


The scripts, along with various test images in the proper form, can be obtained 
by anonymous ftp from www.dms.auburn.edu in pub/compression. Users of 


'"The Math Works, Inc. On-line information available through http://www.mathworks.com. 

2Roughly speaking, Octave is a Matlab-like tool running on many Unix-like platforms and OS/2, 
and is freely-distributable under the GNU Public License (see Section A.3). 

3Maple is a general-purpose symbolic algebra system from Waterloo Maple Software. On-line 
information is available through http://www.maplesoft.com. 

4See http://www.mupad.de or MuPAD User’s Manual, John Wiley & Sons: 0-471-96716-5, 
1996. 
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web browsers can retrieve these using the location 
http://www.dms.auburn.edu/compression 


The files are “packaged” in various formats, in order to simplify installation. 
The files also appear individually in the jpegtool/src subdirectory. 

The files should be unpackaged if necessary and placed in a separate direc- 
tory or folder on your machine. Matlab or Octave must then be made aware of 
the location of these scripts (on some platforms, it suffices to set the location as 
the “working directory”). The precise methods for doing these tasks depends 
on the platform, and will not be discussed in this appendix. 

Some sample images are delivered with the scripts, including several which 
have been widely used as test images for various articles on compression. Other 
images can be used, but the scripts currently require that these be in portable 
graymap format. Many utilities can provide conversions between various graph- 
ics formats and graymaps. Under Unix-like platforms, Poskanzer’s Pbmplus 
toolkit® and xv are commonly used. 


——_— eee 


A.1 Using the tools 


This section describes some ways that the image tools can be used with Octave 
or Matlab to study JPEG-like image compression. Strictly speaking, a graphical 
display is not required, although most users will want to experiment with actual 
images rather than just looking at the matrices. 

An “image” in this context is simply a matrix of integers ranging from 0 to 
255 (representing levels of gray). There are many ways to generate such images 
in Matlab or Octave, but typically the starting point is a “real” image or picture 
which has been saved in portable graymap format. 


Approximation by partial sums 


We begin with a simple example of the use of these scripts. As discussed in 
Chapter 10, the cosine transform exchanges spatial information for frequency 
information. If the transform is 8 x 8, then a given 8 x8 portion of an image 
can be written as a linear combination of the 64 basis matrices which appear 
in Figure A.1(a). The transform provides the coefficients in the linear combi- 
nation, allowing approximations or adjustments to the original image based on 
frequency content. Partial sum approximations are a special case. Often, the 
higher-frequency information in an image is of less importance to the eye, so 
if the terms are ordered roughly according to increasing frequency, then partial 
sum approximations may do well even with relatively few terms. 


5 Extended Portable Bitmap Toolkit. Netpbm is based on the Pbmplus distribution of 10 Dec 91, 
and includes improvements and additions. 
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(a) 8 x 8 basis elements (b) sample image and the 64 partial sums 


Figure A.1: 8 x 8 basis elements, sample image, and the 64 partial sums. 


Let’s take a specific 8 x 8 example. (Users of the student edition of Matlab 
may need to use a5 x5 matrix.) We’ll use ‘>’ to denote the prompt printed by 
Matlab or Octave, but this will vary by platform. Define the test image: 

> x = round(rand(8)*255) % 8x8 random matrix, integer entries in [0,255] 


This will display some (random) matrix, perhaps 


64 80 76 59 157 123 90 237 
252 109 214 220 83 194 181 3 
130 176 10 91 154 148 112 95 
153. 124 149 26 29 199 60 228 

92 166 107 166 108 233 234 III 

91 32 10 190 248 231 160 4 

25 128 255 16 198 209 235 1 

89 217 195 107 213 119 103 183 


and we can view this “image” with 

> imagesc (x) % Matlab users 

> imagesc(x, 8) % Octave users 
Something similar to the smaller image at the lower left in Figure A.1(b) will 
be displayed.© Now ask for the matrix of partial sums (the larger image in 
Figure A.1(b)): 

> imagesc(psumgrid(x)); % Display the 64 partial sums 
The partial sums are built up from the basis elements in the order shown in 
the zigzag sequence of Figure A.2. This path through A.1 is approximately 
according to increasing frequency of the basis elements. 

Roughly speaking, the image in Figure A.1(b) is the worst kind as far as 

JPEG compression is concerned. Since it is random, it will likely have signifi- 
cant high-frequency terms. We can see these by transforming: 


66? 
r 


Octave users may see a somewhat blurred image; if so, and if xv is your viewer, try the 
command while the image window is active. 
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Figure A.2: The zigzag sequence. 


> Tx = det(x) % discrete cosine transform of x 
For the example above, this gives the matrix 


1062.8 —69.2 —68.1 117.2 —107.0 —33.3 22.5 5.2 
—44.7 26.7 117.0 —3.4 96.9 49.7 46.5 —82.4 
17.9 25.2 —39.1 —81.1 18.4 -544  —54 112.0 
—23.4 —115.5 —112.2 68.9 9.3 73.0 —25.7 8.5, 
11.5 -65.7 146.3 —149.9 43.7 —126.2 58.5 —41.2 
—104.5 —82.1 61.4 —27.9 —36.9 —128.5 67.1 74.1 
—77.0 —71.7 —16.9 50.6 170.4 —115.3 —90.4 —54.3 
33 42.3 4.4 5.7 7715 40.3 —102.2 21.9 


Tx= 


of coefficients used to build the partial sums in Figure A.1 from the basis el- 
ements. The top left entry gets special recognition as the DC coefficient; the 
others are the AC coefficients, ACo,, through AC7,7. 

The terms in the lower right of Tx correspond to the high-frequency portion 
of the image. Notice that even in this “worst” case, Figure A.1 suggests that a 
fairly good image can be obtained with somewhat less than all of the 64 terms. 
A similar example with 4x4 basis elements appears in Figure 10.8 on page 278. 

It would be a good exercise at this stage to repeat the above steps with 
some nicer matrix. A constant matrix might be “too nice,” but something “less 
random” than the example may be appropriate. 

The process of approximation by partial sums is applied to a “real” image in 
Figure 10.9 on page 278, where 1/4, 1/2, and 3/4 of the 1024 terms for a 3232 
image are displayed. Our approximations retain all of the frequency information 
corresponding to terms from the zigzag sequence below some selected threshold 
value; the remaining higher-frequency information is discarded. Although this 
can be considered a special case of a JPEG-like scheme, JPEG allows more 
sophisticated use of the frequency information. 
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On to JPEG 


The ‘Lena’ image of Figure A.3 has been widely used in publications on com- 
pression. However, it may be too large for some configurations; if so, the 32x32 
“‘math4.pgm’ image can be substituted in the examples. Perform the transform 
and quantization schemes of JPEG, and view the result: 


ole 


> setdefaults; Set default quantizer, etc. 
> y = jpeg(’lena.pgm’); % Do JPEG transform and quantization scheme 
> imagesc(y); % Display the resulting image 
At this point, the image in Figure A.3(b) should be visible. This isn’t a very 
exciting use of the tools, but it does illustrate the mechanics.’ 
Let’s examine the process more carefully. Recall that JPEG compression 
works by transforming an image so that the frequency information is directly 
available, and then quantizing in a way that tends to suppress some of the high- 
frequency information and also so that most of the terms can be represented in 
fewer bits. To “recover” the image, there is a dequantizing step followed by 
an inverse transform. (We’ve ignored the portion of JPEG which does lossless 
compression on the output of the quantizer, but this doesn’t affect the image 
quality.) 
In the above example, the default quantizer stdQ is used. If we do the 
individual steps which correspond to the above fragment, we might write 
> setdefaults; 
> x = getpgm(’lena.pgm’); 
> Tx = dct (x); 

> QTx = quant (Tx); 

> 

> 

> 


ole 


Set the default quantizer, etc. 

Get a graymap image 

Do the 8x8 cosine transform 

Quantize, using standard 8x8 quantizer 
Dequantize 

Recover the image 

Display the image 


ae oP ole 


ole 


Ty = dequant (QTx) ; 
y = invdct (Ty); 
imagesc(y); 


ole 


ole 


It should be emphasized that we cannot recover the image completely—there 
has been loss of information at the quantizing stage. It is illustrative to compare 
the matrices x and y. The difference image x — y for this kind of experiment 
appears in Figure 10.13 on page 289. There is considerable interest in measur- 
ing the “loss of image quality” using some function of these matrices. This is a 
difficult problem, given the complexity of the human visual system. 

This still isn’t very exciting. In JPEG-like schemes, there are (at least) two 
obvious places to experiment: choice of quantizer, and size of the block used in 
the transform (which also affects the choice of quantizer). 


Adjusting the quantizer 


The choice of quantizer can, of course, greatly affect the results, both in terms 
of compression and quality. The default quantizer in these tools is the standard 


1To be precise, a rounding procedure should be done on the matrix y. In addition, we have 
ignored the zero-shift specified in the standard, which affects the quantized DC coefficients. 
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JPEG 8 x8 luminance matrix 


16 11 10 16 24 40 = 51 61 
12 12 14 19 26 8658 60 55 
14 13 16 24 40 57 69 56 
14 17 22 29 51 87 80 9662 
18 22 37 56 68 109 103 77 
24 35 55 64 81 104 113 92 
49 64 78 87 103 121 120 = 101 
72 92 95 98 112 100 103 99 


stdQ= 


We can see the effects of more (or less) quantizing by scaling this matrix (or 
choosing an entirely different matrix). In addition, we can ask jpeg or quant to 
report the amount of compression. For example, using the standard quantizer, 


> = getpgm(’lena.pgm’); % Store the original image in x 


x 
ly, r] = jpeg(x); jpeg, result in y and ratio inr 
£ 


ole 


> 
> % Print the compression ratio 
r= 73.0 
where the last line represents the compression achieved at the quantizing stage.® 
Now we’ll use a more aggressive quantizer (namely, stdQ*2): 

> [z, r] = jpeg(x, stdQ*2); % jpeg, more aggressive quantizer 
>r % Print the compression ratio 
re 


The compression is better, but at the cost of some degradation in the image 
quality. The images are in x, y, and z, respectively, and can be displayed with 
the image commands shown earlier. It may also be illustrative to examine the 
errors x — y and x—z. 

Many of the tools can be called with different numbers of arguments, with 
different types of arguments in a given position, and with different return val- 
ues. This is typical of routines in Octave or Matlab, and can be very convenient. 
It is important to understand how the tools choose the quantizer and blocksize 
if these are not explicitly specified in the call. For example, if quant is called 
without a second argument (i.e., there is no quantizer specified), then a global 
quantizer QMAT is used (where QMAT defaults to stdQ). The quantizer used by 
quant becomes the new value for QMAT, and is used by default in other func- 
tions such as dequant. Hence, the following sequence of calls would give the 
expected results: 

> q = stdQ*2; Tx = dct(x, length(q)); 

> QTx = quant (Tx, q); dequant (QTx); 
This “works” because the call to quant will set g as the new value for the global 
quantizer QMAT. Since dequant is called without specifying a quantizer as the 
second argument, the routine will use QMAT as the quantizer. (In this example, 
it would be preferable to set QMAT directly, and then omit passing the blocksize 
to dct and the quantizer to quant.) 

Of course, it would also be “legal” to replace the second line with the se- 
quence 


8See the reference section A.2 for the interpretation of the compression percent. 
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QTx = quant (Tx, stdQ*2); dequant(QTx, stdQ); % wrong 


This may lead to interesting results, but it is probably not what was intended, 
since quant and dequant are using different quantizers. Similarly, the fragment 


Tx = det(x, 16); quant(Tx, stdQ); % wrong 


is “legal,” but probably undesirable since the transform is on 16 x 16 blocks but 
stdQ is 8x8. As a final example, note that a call such as jpeg (x, stdQ*2) will 
set the global quantizer. If this call is followed by jpeg (x), then the result will 
be the same as the first call. 


Adjusting the blocksize 


The second obvious area for experimentation is the blocksize used in the trans- 
form. There are a number of reasons given for the 8 x 8 blocksize, includ- 
ing hardware constraints. One heuristic consideration mentioned is that larger 
blocks are more likely to include portions of the image which are very different, 
in contrast to the “roughly constant” blocks on which JPEG does best. This is, 
of course, dependent on the resolution, but it is perhaps reasonable by today’s 
standards. 

On the other hand, better compression may be achieved on some images 
if a larger blocksize is chosen. Changing the blocksize, however, also changes 
the quantizer. As an experiment, we examine the 256 x 256 ‘Lena’ image under 
the standard JPEG 8 x8 scheme and with modified schemes using transforms of 
sizes 16x 16 and 32 x32. For the 8 x8 quantizer, the suggested JPEG luminance 
quantizer was used. For the other quantizers, matrices were chosen with the 
typical properties of a quantizer; e.g., entries increase from the top left to the 
bottom right. Matrices with this rough property can be obtained from the Hilbert 
matrix, available in Matlab or Octave with the ‘hilb’ function. The fragment 
for the 16 x 16 experiment appears below: 


> x = getpgm(’lena.pgm’); % Get the Lena image 
>q=6 ./ hilb(16) + 26 & Possible 16x16 quantizer 
> y = jpeg(x, q); % Do 16x16 transform 

> imagesc(y); % Matlab users 

> imagesc(y, 2); % Octave users 


Figure A.3 shows the results. An analysis of the compression and the image 
quality needs to be done before making any definitive statements. In addition, 
there are some serious questions about our choice of quantizers for the larger 
blocksizes. However, it is perhaps safe to say that the 32 x 32 transform is 
too large for this particular example. It’s not hard to see why there is serious 
degradation: a 32 x 32 block covers a relatively large portion of the image, and 
much of the “local” property on which JPEG relies has been lost. 


A JPEG enhancement 


As a final example of the use of these scripts, we consider an enhancement to 
JPEG described in Pennebaker and Mitchell [57]. One troublesome aspect of 
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(c) 16x 16 JPEG (d) 32x 32 JPEG 


Figure A.3: ‘Lena’ with various transform sizes. 


JPEG-like schemes is the appearance of “blocking artifacts,” the telltale discon- 
tinuities between blocks which often follow aggressive quantizing. The image 
on the left in Figure A.5 was produced using stdQ*4 as the quantizer. Clearly 
visible blocks can be seen, especially in the “smoother” areas of the image. 
Since the DC coefficients represent the (scaled) ee C, F, C, ae C, 
average value over the block, it might be reasonable 
to use the nearest-neighbor coefficients to smooth 
a given block by predicting the low-frequency AC "pc, DCs DC, 
coefficients. Any low-frequency AC coefficients 
which are zero will be replaced by the predicted 
values. However, the replacement values should | Dc, DCs DCy 
be “clamped” so that (in magnitude) they do not 
exceed one-half of the corresponding value in the 
quantizer (values larger than this would not have quantized to zero). 
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Think of the original image X as a surface with height at (y,x) given by 
X(y,x). Fora given N x N block (the block corresponding to DCs in the grid), 
the 3 x3 superblock consisting of its nearest neighbors contains 37 N? total en- 
tries. Fit a polynomial 


p(y, x) = ax? y* + anx?y +a3xy? + agx? + a5sxy + agy? +.a7x + agy +49 


by requiring that the average value over the kth submatrix equal DC, (this gives 
nine equations for the unknowns dj, ...,a9). The polynomial defines a surface 
over the center block, which approximates the corresponding portion of the 
original surface.? Figure A.4 shows a surface X in (a) and the polynomial ap- 
proximations in (b). 

The polynomial approximation is fed through the cosine transform, giving 
AC “predictor” coefficients (in terms of the DC coefficients) for the correspond- 
ing portion of the original surface. The first two such predictors (which can be 
obtained from from the ‘deblockc’ Maple script) are given by 


ACo,1 =a(DC4—DC6) and AC,9=a(DC2— DCs), 


where 


a= ~2 (cost +3cos2Z + 5cos 3% +7cos <) = 0.14235657. 


The decoder, which only has the quantized information from the original sur- 
face, uses these predictors to “guess” suitable values for the low-frequency AC 
coefficients (subject to the clamping described above). Figure A.4 illustrates the 
process, where the lowest five AC coefficients were considered for smoothing. 
The procedure applied to an aggressively quantized ‘bird’ appears in Figure A.5. 


As an elementary example of this smoothing process, we can consider a 
single 3 x3 superblock. Since the smoothing process is done on the matrix 
which results from 

e transform Tx quantize OTx dequantize ¥, 
we may as well define Ty directly. Let’s take Ty to be zero, except for the 9 
DC coefficients (which we take to be 1-9, respectively): 


> Ty = zeros (8*3); % Initialize 24x24 superblock 

> for k=1:9 % 9 DC coefficients to define 

> r = floor((k-1)/3)*8+1; % Determine proper (row,col) of DC_k 
> c = rem(k-1,3)*8+1; 

> Ty(r, c) =k; % Define the DC_k entry 

> end 

> Ty % Display the matrix 


Since all the AC coefficients are zero, the result of the inverse transform (on 
one of the 9 blocks) will be a matrix with k/8 in every entry, where k is the DC 
coefficient. Display this with: 


> imagesc(invdct (Ty,8)); % Display the result 


There is a scaling involved at this stage. 
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(c) JPEG on original surface (d) smoothed version 


Figure A.4: The smoothing process. 


Figure A.5: ‘bird’ with aggressive quantizing, then smoothed. 
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Figure A.6: Original and the “smoothed” superblock. 


Finally, we can see the results of the smoothing process with 


> Tz = deblock(Ty, stdQ); % Smooth 
> imagesc(invdct (Tz, 8)); & Display the result 
The two images appear in Figure A.6. 

This simple example shows what the smoothing process would do with 
such a superblock, but it is not very clear if this is a viable process. Looking 
back at Figure A.5, we might even ask why the process didn’t do better. As an 
experiment, the smoothing procedure can be directed to consider more (or less) 
than the five lowest-frequency AC coefficients. 

A better question might be “How much could be expected?” Recall that 
the procedure uses only the DC coefficients of nearest neighbors. This scheme 
is attractive, in part because of its simplicity and the fact that it can be used as 
a back-end procedure to JPEG (regardless if the original file was compressed 
with this in mind). However, JPEG achieves its rather impressive compression 
by discarding information. The smoothing procedure sometimes makes good 
guesses about the missing data, but it cannot recover the original information. 


© 2003 by CRC Press LLC 


314 


A JPEGtool User's Guide 


—————— eee 


A.2 Reference 


basis 
Purpose Calculate basis matrices. 
Synopsis _ basis(v, u) 
basis(v, u, N) 
Description basis finds the (v, w) basis matrix of size Nx N. N defaults to 8. 
See also basisgrid, psum, psumgrid 
basisgrid 
Purpose Create matrix of N x N basis elements. 
Synopsis _ basisgrid 
basisgrid(V) 

Description The N x N basis matrices in a cosine expansion are returned in a 
matrix (with (V + 1)N rows, due to the space between submatrices). N 
defaults to 8. 

See also basis, psumgrid 

det 
Purpose Perform discrete cosine transform on an image. 
Synopsis dct 
det(X) 
dct(X, N) 

Description dct cuts an m xn image X into N x N subimages and transforms 
these images. If X is not given, then X = ans. If N is not given, then N is 
the size of the global quantizing matrix QMAT, or, if QMAT is undefined 
or of size 0, then N = 8. 

See also invdct 

dctmat 


Purpose Build the N x N matrix for the cosine transform. 


Synopsis dctmat 
dctmat(N) 


Description The NV x N matrix for the cosine transform is returned. N defaults 
to 8. 


Example If X is Nx N, then 
c= dctmat(N); c* X*c’ 


would give the cosine transform of X. 
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deblock 
Purpose Smooth blocking artifacts. 
Synopsis deblock(X) 
deblock(X, Q) 
deblock(X, k) 
deblock(X, Q,k) 

Description deblock performs a smoothing procedure in an effort to reduce 
blocking artifacts (the telltale discontinuities between blocks which of- 
ten follow aggressive quantizing). For a given block, a polynomial is 
fitted using the DC coefficients from nearest neighbors. The k (default 
5) lowest-frequency zero AC coefficients are replaced by values from the 
polynomial. Replaced values are clamped so that (in magnitude) they do 
not exceed values which would not have quantized to zero. If the quan- 
tizer Q is not given, then the global QMAT is used. 

The procedure is applied to the matrix Ty which (at least conceptually) 
is the result of 
peta kee Tx mae OTx vegan Ty, 
Bugs It’s slow. 
deblocke [Maple] 


Purpose Calculate AC coefficients in a smoothing scheme. 


Synopsis deblockc() 
deblockc(k) 
deblockc(k, N) 


Description deblockc is a Maple procedure to find the AC coefficients (in 
terms of the DC coefficients) which could be used as part of a smoothing 
procedure to reduce blocking artifacts (the telltale discontinuities between 
blocks which often follow aggressive quantizing) in JPEG. 


This procedure may be of interest since it symbolically solves for the co- 
efficients; however, no routines depend on deblockc. See the text for ex- 
amples and more information. The procedure is described in Pennebaker 
and Mitchell [57]. 


Example 
AC := deblockc(k, N) : 


will display k (default value is 5) of the terms in the zigzag sequence 
{ACo,1, AC1,0, AC2,0,...} (first symbolically, then numerically) and the 
results are stored in AC. N is the blocksize used by the transform (default 
is 8). 
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dequant 


do_dct 


getint 


getpgm 


invdct 
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Purpose Unapply quantizing matrix. 


Synopsis dequant 
dequant(X) 
dequant(X, Q) 


Description dequant cuts an m xn matrix X into N x N matrices and dequan- 
tizes using the matrix Q. If X is not given, then X = ans. If Q is not 
given, then Q = QMAT. 


Purpose Utility routine to do forward or inverse transform. 
Synopsis do-dct(X, N, inv) 


Description do_dct cuts an mxn image X into N x N subimages and trans- 
forms these images. It is designed to be called from higher-level routines 
such as det and invdct. WN is the size of the transform, and inv is a flag 
indicating forward (FALSE) or inverse (TRUE). 


Bugs It’s slow. 
See also dct, invdct 


Purpose Retrieve an integer from a stream. 
Synopsis getint(fid) 
Description getint reads the next integer from fid. All characters from ‘#’ to 


the end of a line are ignored. Intended to be used by other scripts, such 
as getpgm. 


Purpose Read a graymap file. 
Synopsis getpgm(filename) 
[x, maxgray] = getpgm(filename) 
Description getpgm reads a pgm file (in either raw PS or ascii P2 format) 


and returns a matrix suitable for display with the image function. The 
maxgray return value is the maximum gray value (see pgm(5)). 


Bugs Under Octave-1.1.1, only raw P5 format can be used. 


Purpose Perform inverse cosine transform on an image. 
Synopsis invdct 

invdct(X) 

invdet(X, NV) 
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Description invdct cuts anm xn image X into Nx N subimages and performs 
an inverse cosine transform. If X is not given, then X = ans. If N is not 
given, then N is the size of the global quantizing matrix QMAT, or, if 
QMAT is undefined or of size 0, then N = 8. 


See also dct 


jpeg 
Purpose Converts image via transform — quantize — invert. 
Synopsis jpeg(X) 
jpeg(X, Q) 
[Y,r] = jpeg(X) 
[Y,r] = jpeg(X, Q) 

Description jpeg takes the original image X (which may be a matrix or a file- 
name) and uses the cosine transform and the specified quantizing matrix 
Q to generate a new image Y. The process is lossy at the quantizing 
stage, and Y will usually differ from X. 

If the quantizer is not given, then the global quantizer QMAT will be 
used. Initially, QMAT is the 8 x 8 JPEG luminance matrix. The quantizer 
becomes the new value for QMAT. 

If the ratio r is requested, then a calculation of the “lossy compression” 
is performed. This is returned as a percentage and measures the amount 
of savings obtained at the quantizing stage. 

Bugs The ratio measures only the savings obtained by removing “trailing ze- 
ros” in the submatrices. This often gives useful information about the 
choice of quantizer, but it is not the whole story of compression. 

psum 

Purpose Calculate partial sums. 

Synopsis psum(X, 7) 

Description The nth partial sum in the cosine series for X is returned. 

psumgrid 


Purpose Calculate partial sum grid. 
Synopsis psumgrid(X ) 
Description The partial sums for X are collected (in zigzag order) in a matrix. 


See also basisgrid, psum 
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quant 


Purpose Apply quantizing matrix. 
Synopsis quant 
quant(X) 
quant(X, Q) 
[Y,r] = quant 
[Y,r] = quant(X) 
[Y,r] = quant(X, Q) 

Description quant cuts anm xn matrix X into Nx N submatrices and quantizes 
using the Nx N matrix Q. If X is not given, then X = ans. If Q is not 
given, then Q = QMAT, or, if QMAT is undefined or of size 0, then Q is 
chosen to be the 8x8 JPEG luminance matrix. Q becomes the new value 
for QMAT. 


If the ratio r is requested, then a calculation of the “lossy compression” 
is performed. This is returned as a percentage and measures the amount 
of savings obtained at the quantizing stage. 


Bugs See the Bugs under jpeg for the interpretation of the ratio. 
setdefaults 


Purpose Set defaults for jpegtool session. 
Synopsis setdefaults 


Description Sets various global defaults, such as the quantizing matrix QMAT 
and the colormap. Should be run at the start of every session. 


trailnum 


Purpose Count trailing zeros in zigzag sequence. 
Synopsis trailnum(X) 


Description JPEG-like compression leads to matrices which usually contain 
zeros in the high-frequency entries. Counting the number of trailing zeros 
(in the zigzag pattern) gives an indication of the compression achieved at 
the quantizing stage; trailnum returns this count. 


See also jpeg, quant 
zigzag 
Purpose Generate traversal-through-matrix used by jpeg. 
Synopsis zigzag 
zigzag(N) 

Description The skew-diagonal-traversal-pattern of jpeg for an N x N matrix 
is returned in an N x2 matrix. N defaults to 8. The N x2 matrix contains 
the appropriate row and column indices (starting at 1). 
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A.3 Obtaining Octave 


Octave is a high-level language, primarily intended for numerical computations. 
It provides a convenient command line interface for solving linear and nonlinear 
problems numerically. Currently, Octave runs on Unix-like platforms, OS/2, 
and MS-Windows (via Cygwin).!° 

Octave is free software; you can redistribute it and/or modify it under the 
terms of the GNU General Public License as published by the Free Software 
Foundation (FSF).!'! You can get Octave from a friend who has a copy, by 
anonymous ftp, or by ordering a tape or CD-ROM from the FSF. 


Free Software Foundation Voice: +1-617-542-5942 
59 Temple Place - Suite 330 Fax: +1-617-542-2652 
Boston, MA 02111-1307, USA — E-Mail: gnu@gnu.org 


Octave is developed by John W. Eaton, with contributions from many folks. 
Complete sources, documentatation, and ready-to-run executables for several 
popular systems are available via www.octave.org. The GNU Octave Manual 
by John W. Eaton “is also now available and may be ordered from http://www. 
network-theory.co.uk/octave/manual/. Any money raised from the sale of this 
book will support the development of free software. For each copy sold, $1 will 
be donated to the GNU Octave Development Fund.” 


Support Programs 


Octave relies on an external program to view images. The default is John 
Bradley’s xv, but xloadimage or xli can also be used (OS/2 uses ghostview). 

xv has a generous license, and use of xv generally requires registration. 
Complete details are available with the source distribution. The latest version of 
xv (or at least a pointer to it) is available via anonymous ftp on ftp.cis.upenn.edu, 
in the directory pub/xy; the official site is now http://www.trilon.com. 

xloadimage was written by Jim Frost and may be obtained from ftp.x.org 
under RScontrib. The x/i viewer was written by Graeme Gill and is based on 
xloadimage. It may be obtained from ftp.x.org in contrib/applications. 

Information on GSview, ghostview, and ghostscript may be obtained from 
http://www.cs.wisc.edu/~ghost, which is maintained by the author of GSview, 
Russell Lang. 


10The authors have used Octave under GNU/Linux i486 and on Sun SPARCs running So- 
laris. The OS/2 port was done by Klaus Gebhardt and is available from http://hobbes.nmsu.edu 
in os2/apps/math/. An article by Isaac Leung on the OS/2 version appears in OS/2 eZine magazine, 
16 July 2002, www.os2ezine.com. 

'Il'The FSF is a nonprofit organization that promotes the development and use of free software. 
(The word “free” refers to freedom, not price.) The GNU Project was launched in 1984 to develop 
a complete Unix-like operating system (the Hurd). Variants of the GNU operating system, which 
use the kernel Linux, are now widely used. GNU (guh-NEW) is a recursive acronym for “GNU’s 
Not Unix.” 
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Source Listing for LZRW1-A 


This appendix contains the complete listing of the LZRW1-A dictionary scheme 
discussed in Chapter 9, along with some additional notes on the algorithm. The 
authors of this book are grateful to Dr. Ross N. Williams for permission to 
include the sources. ! 

Williams describes LZRW1-A as “a direct descendant of LZRW1 [83]” 
with optimizations. These algorithms illustrate design decisions favoring speed 
and low resource requirements over compression. The expensive search through 
the history for a match has been almost completely eliminated by the use of a 
hash function, illustrated in Figure B.1. At each stage, a hash of the first three 
characters of the lookahead gives an index into the hash table. The current 
value in the hash table is used for attempting a match, the hash table is updated 
to point at the first character of the lookahead, and then the window is moved. 
In the case of a match, the window will be moved by at least 3 characters, and 
no additional updating of the hash table occurs until the next match attempt. 


hash table 
0 


| —-@ = 


4095 : hash 
function 


She sells sea shlellis by the seasho 


kk 4095 bytes 18 bytes 
history lookahead 


Figure B.1: Hashing in LZRWI-A. 


In short, at each stage the hash table contains pointers to the most recent oc- 
currence of a 3-character sequence with the same hash, and which was obtained 
at some previous matching attempt. From this, the dictionary can be identified 
as all the 3-18 character sequences which start at offsets from the hash table 


'Permission obtained via email 6 August 1996. Williams can be reached electronically through 
http://www.rocksoft.com/ross. 
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(and which start in the history), together with the characters from the symbol 
set. 


As an example, suppose the fragment “Llsea’ is added to the end of the 
“She sells...” text, giving 


She sells sea shells by the seashore sea 
A (6,3) 


ik (11, 5) 


(24,5) 


where the underlines indicate characters matched against the dictionary, and 
the vertical line indicates the current position in the scan. At this stage, the 
“preferred” match is to ‘eLlsea’ in the history. However, this fragment is not 
in the dictionary, and the match will actually be to the letters ‘eLlse’ in the first 
two words of the example. 


Definitions and documentation 


The C sources for LZRW1-A are listed on the next few pages. Some changes 
for portability have been made; in particular, it was necessary to change the in- 
terface slightly. Williams wrote a family of LZRW algorithms,” and his basic 
framework has been retained in these sources. In addition, minor reformatting 
was done, and the note on patents was deleted since it was discovered that the 
algorithm may be covered by patent (see Appendix C). The first portion pre- 
sented below consists of definitions and documentation. 


[FRI I IIR IIR III RII III III IIR ICRI ICICI RA IOI RA A IIE / 


/* */ 
/* LZRW1-A.C */. 
/* */ 
[FRI IIR IIR ICI RII III III I IIR III RII RIOR I A IIE / 
/* */ 
/* Author : Ross Williams. */ 
/* Date : 25 June 1991. wid 
/* Release : 1. wi 
[* ty: 
[FORO IFO III ICICI III III ICICI ICICI ICICI ICICI ICICI I ICI KA A IOI / 
es i 
/* This file contains an implementation of the LZRW1-A data compression */ 
/* algorithm in C. af. 
i* <i 
/* The algorithm is a general purpose compression algorithm that runs */ 
/* fast and gives reasonable compression. The algorithm is a member of * if 
/* the Lempel-Ziv family of algorithms and bases its compression on the */ 
/* presence in the data of repeated substrings. */ 
Le ay 
/* The algorithm/code is based on the LZRW1 algorithm/code. Changes are: */ 


[* 1) The copy length range is now 3..18 instead of 3..16 as in LZRW1. */ 
/* 2) The code for both the compressor and decompressor has been */ 


2LZRW 1-3 are described briefly in [26]. Complete sources are available on Williams’ site (see 
Appendix C). 
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/* WARNING: This algorithm is non-deterministic. Its compression 


/* performance 


/* 


may vary slightly from run to run. 
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#1. 


[FOR III ICICI III ICICI ICICI RICCI I IOI RA IK A A II / 


#include "port.h" 
#include "compress.h" 


/* INCLUDE FILES 


/* Defines symbols for the non portable stuff. 
/* Defines single exported function "compress". 


[FORO IIR III CII III ICICI ICICI III RICO ITOK RR IK A A IOI / 


/* The following structure is returned by the "compress" function below 
/* when the user asks the function to return identifying information. 
/* The most important field in the record is the working memory field 
/* which tells the calling program how much working memory should be 

/* passed to "compress" when it is called to perform a compression or 


/* decompression. 


static struct compress_identity identity 


{ 


0x4B3E387B, 


sizeof 


(UBYTE**) *4096, 


"LZRW1-A", 


ay 208; 


"22-Jun-1991", 

"Public Domain", 

"Ross N. Williams", 
"Renaissance Software", 
"Public Domain" 


hi 


LOCAL void compress_compress 


/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 


For more information on this structure see compress.h. 


Algorithm identification number. 


Working memory (bytes) to alg. 
Name of algorithm. 

Version number of algorithm. 
Date of algorithm. 

Copyright notice. 

Author of algorithm. 
Affiliation of author. 

Vendor of algorithm. 


(void *,UBYTE *,ULONG, UBYTE *,ULONG *); 


LOCAL void compress_decompress (UBYTE *,ULONG, UBYTE *,ULONG *); 


[FORO IRI III III III IOI ICICI III ICICI ITOK A IK A A II / 


/* This function is the only function exported by this module. 


/* on its first parameter, the function can be requested to compress a 
/* block of memory, decompress a block of memory, or to identify itself. 
/* For more information, see the specification file "compress.h". 


EXPORT void compress (action, wrk_mem, src_adr, src_len, dst_adr,p_dst_len) 
action; /* Action to be performed. 

of working memory we can use. 

of input data. 

of input data. 


UWORD 

void *wrk_mem; /* Address 
UBYTE *src_adr; /* Address 
ULONG src_len; /* Length 
UBYTE *dst_adr; /* Address 
ULONG *p_dst_len; /* Address 
{ 

switch (action) 


{ 


case COMPRESS_ACTION_IDENTITY: 
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* (struct compress_identity 


to put 


output data. 


of longword for length of output data. 


**) wrk_mem = &identity; 


Depending */ 


s/f 
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break; 

case COMPRESS_ACTION_COMPRESS: 
compress_compress (wrk_mem, src_adr, src_len, dst_adr, p_dst_len) ; 
break; 

case COMPRESS_ACTION_DECOMPRESS: 
compress_decompress (src_adr, src_len, dst_adr,p_dst_len) ; 
break; 


[FORO III IRI IIR III ICICI ICICI IIR RICK I TOOK RIK A A II / 


rad a 
/* The remainder of this file contains some definitions and two more */ 
/* functions, one for compression and one for decompression. This section */ 
/* contains information and definitions common to both algorithms. *f 
/* Most of this information relates to the compression format which is ars 
/* common to both routines. */ 
[* */ 
[FOR III IRI IIR II IIR ICICI II IO RICK ITOK A IK A A II / 
/* */ 
/* DEFINITION OF COMPRESSED FILE FORMAT */ 
/* a */ 


/* * BR compressed file consists of a COPY FLAG followed by a REMAINDER. */ 
/* * The copy flag CF uses up four bytes with the first byte being the “yf 


/* least significant. wa 
/* * Tf CF=1, then the compressed file represents the remainder of the a 
/* file exactly. Otherwise CF=0 and the remainder of the file consists */ 
/* of zero or more GROUPS, each of which represents one or more bytes. */ 
/* * Each group consists of two bytes of CONTROL information followed by */ 
ie sixteen ITEMs except for the last group which can contain from one */ 
/* to sixteen items. */ 
/* * An item can be either a LITERAL item or a COPY item. */ 
/* * Each item corresponds to a bit in the control bytes. */ 
/* * The first control byte corresponds to the first 8 items in the * 
i* group with bit 0 corresponding to the first item in the group and */ 
ye bit 7 to the eighth item in the group. #/ 
/* * The second control byte corresponds to the second 8 items in the ap 
/* group with bit 0 corresponding to the ninth item in the group and */ 
/* bit 7 to the sixteenth item in the group. 4] 
/* * BR zero bit in a control word means that the corresponding item is a */ 
/* iteral item. A one bit corresponds to a copy item. a); 
/* * BR literal item consists of a single byte which represents itself. «if 
/* * BR copy item consists of two bytes that represent from 3 to 18 bytes.*/ 
/* * The first byte in a copy item will be denoted Cl. ud 
/* * The second byte in a copy item will be denoted C2. <f 
/* * Bits will be selected using square brackets. */ 
/* For example: C1[0..3] is the low nibble of the first control byte. */ 
/* of copy item Cl. eh 
/* * The LENGTH of a copy item is defined to be C1[0..3]+3 which is a Me 
7* number in the range [3,18]. */ 
/* * The OFFSET of a copy item is defined to be C1[4..7]*256+C2[0..8] #f 
/* which is a number in the range [1,4095] (the value 0 is never used) .*/ 
/* * BR copy item represents the sequence of bytes */ 
/* text [POS-OFFSET..POS-OFFSET+LENGTH-1] where "text" is the entire */ 
/* text of the uncompressed string, and POS is the index in the text a 


}* of the character following the string represented by all the items */ 
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we 
*/ 


[FOI III IOI ICICI III ICICI ICICI ICICI ICICI ICICI I ICI KI A IOI / 


[The following define defines the length of the copy flag that appears 
at the start of the compressed file. I have decided on four bytes so 
as to make the source and destination longword aligned in the case 
where a copy operation must be performed. 

[The actual flag data appears in the first byte. The rest are zero. 
fine FLAG_BYTES 4 /* How many bytes does the flag use up? 


/* The following defines define the meaning of the values of the copy 
/* flag at the start of the compressed file. 


#de 
#de 


fine FLAG_COMPRESS 0 


/* Signals that output was result of compression. 
fine FLAG_COPY 1 /* Signals that output was simply copied over. 


a 
ay 
*/ 
ff 
*/ 
*/ 


+ 
*/ 
ee 
*/ 


[FORO II ICICI III III I ICICI ICICI ICICI ICICI IC ICICI I ICI RI A IOI / 


The compress routine 


The main compression routine is listed next. The hash function can be seen in 
the assignment of p_entry. (See exercise 9.1.5 for information on this type of 
hash function.) The routine returns the original string (after the header bytes) in 
the case that expansion occurs (this is the overrun case). 


LOCAL void compress_compress (p_wrk_mem, p_src_first,src_len, 


/* 
/* 


UBYTE *p_sr 


Input 
Input 
Input 
Input 


p_dst_first,p_dst_len) 


: Specify input block using p_src_first and src_len. 
: Point p_dst_first to the start of the output zone (OZ). 
: Point p_dst_len to a ULONG to receive the output length. 


Output : 
Output : 
Output : 


Output 


Input block and output zone must not overlap. 
Length of output block written to *p_dst_len. 
Output block in Mem[p_dst_first..p_dst_first+*p_dst_len-1]. 
May write in OZ=Mem[p_dst_first..p_dst_first+src_len+288-1]. 


: Upon completion guaranteed *p_dst_len<=src_len+FLAG_BYTES. 


*/ 
*/ 
*/ 
*/ 
*/ 
*/ 
a 
*/ 


first,*p_dst_first; ULONG src_len,*p_dst_len; void *p_wrk_mem; 


#define PS *ptt+!=*p_srct+ /* Body of inner unrolled matching loop. 


#define ITEMMAX 18 


/* Max number of bytes in an expanded item. 


#define TOPWORD 0xFFFF0000 


{ 


register UBYTE *p_src=p_src_first, *p_dst=p_dst_first; 

UBYTE *p_src_post=p_src_first+src_len, *p_dst_post=p_dst_first+tsrc_len; 
UBYTE *p_src_maxl, *p_src_max1l6; 

register UBYTE **hash= p_wrk_mem; 

UBYTE *p_control; register ULONG control=TOPWORD; 

p_src_maxl= (src_len>=ITEMMAX) ? p_src_post-ITEMMAX+1 : p_src; 
p_src_maxl6= (src_len>=16*ITEMMAX) ? p_src_post-16*ITEMMAX+1 : p_src; 
*p_dst=FLAG_COMPRESS; {UWORD i; for (i=1;i<FLAG_BYTES;i++) p_dst[i]=0;} 
p_dst+=FLAG_BYTES; p_control=p_dst; p_dst+=2; 


wh 


ile (TRUE) 

{register UBYTE *p,**p_entry; register UWORD unroll=16; 
register ULONG offset; 

if (p_dst>p_dst_post) goto overrun; 
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tf 
*/ 
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if (p_src>=p_src_maxl6 

{unroll=1; 

if (p_src>=p_src_max1) 

{if (p_src==p_src_post) break; goto literal; }} 

begin_unrolled_loop: 

p_entry=&hash 

[ ((40543* (( ( (p_sre[0]<<4) “*p_sre[1])<<4)*p_src[2]))>>4) & OxFFF]; 
p=*p_entry; *p_entry=p_src; offset=p_src-p; 


if (offset>4095 || p<p_src_first || offset==0 || Ps || Ps || PS) 
{p_src=*p_entry; literal: *p_dst++=*p_srct+; controlé&=0xFFFEFFFF; } 
else 
{ps || PS || PS || PS || PS || PS || Ps || Ps || 


|| || Ps | 
ps || ps || ps || Ps || Ps || PS || PS || p_sre+t+; 
*p_dst+t=((offset&0xF00)>>4) | (--p_src-*p_entry-3) ; 
*p_dst++=offset&0xFF; } 
control>>=1; 
end_unrolled_loop: if (--unroll) goto begin_unrolled_loop; 
if ((control&TOPWORD) == 0) 
{*p_control=controlé&0xFF; *(p_controlt+1)=(control>>8) &0xFF; 
p_control=p_dst; p_dstt+=2; control=TOPWORD; } 


while (control&TOPWORD) control>>=1; 
*p_control++=control&OxFF; *p_controlt+tt=control>>8; 
if (p_control==p_dst) p_dst-=2; 
*p_dst_len=p_dst-p_dst_first; 

return; 


overrun: fast_copy(p_src_first,p_dst_first+FLAG_BYTES, src_len) ; 
*p_dst_first=FLAG_COPY; *p_dst_len=src_len+FLAG_BYTES; 


The decompress routine 


Like many LZ77 schemes, the decoder is especially simple and fast. The as- 
signment of bits to the offset and length, and the separation of the control bits 
into a control word minimize the amount of bit-shifting which must be done. 


LOCAL void compress_decompress (p_src_first,src_len,p_dst_first,p_dst_len) 


/* Input : Specify input block using p_src_first and src_len. a} 
/* Input : Point p_dst_first to the start of the output zone. Af: 
/* Input : Point p_dst_len to a ULONG to receive the output length. if 
/* Input : Input block and output zone must not overlap. User knows */ 
/* Input : upperbound on output block length from earlier compression. */ 
/* Input : In any case, maximum expansion possible is nine times. tf: 
/* Output : Length of output block written to *p_dst_len. af 
/* Output : Output block in Mem[p_dst_first..p_dst_firstt+*p_dst_len-1]. */ 
/* Output : Writes only in Mem[p_dst_first..p_dst_firstt+*p_dst_len-1]. */ 
UBYTE *p_src_first, *p_dst_first; ULONG src_len, *p_dst_len; 


{ 

register UBYTE *p_src=p_src_first+FLAG_BYTES, *p_dst=p_dst_first; 
UBYTE *p_src_post=p_src_firstt+src_len; 

UBYTE *p_src_maxl6=p_src_firstt+src_len-(16*2); 

register ULONG control=1; 

if (*p_src_first==FLAG_COPY) 
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{fast_copy (p_src_first+FLAG_BYTES, p_dst_first,src_len-FLAG_BYTES) ; 
*p_dst_len=src_len-FLAG_BYTES; return; } 
while (p_src!=p_src_post) 
{register UWORD unroll; 
if (control==1) {control=0x10000|*p_srct++; control |=(*p_srctt) <<8;} 
unroll= p_srce<=p_src_maxl6 ? 16: 1; 
while (unroll--) 
{if (controlé1) 
{register UWORD lenmt; register UBYTE *p; 


lenmt=*p_src+t; p=p_dst- ( ( (Lenmt&0xF0)<<4) |*p_srct+t); 
kp _dst++=*pt+; *p_dst++=*pt+; *p_dst++=*ptt+; 
lenmt&=0xF; while (lenmt--) *p_dst++=*pt+; } 

else 


*p_dst++=*p_srctt; 
control>>=1; 
} 


} 
*p_dst_len=p_dst-p_dst_first; 


The compress header 


This is a generic header file used by Williams in implementing several compres- 
sion schemes. 


[FORO III IORI IIR III III IIR RICK ITOK RIK A A II / 


/* */ 
ie COMPRESS .H */ 
ag */. 
[FOR III I III III ICICI ICICI I IIR RICO ITOK A IK A A II / 
/* 2s 
/* Author : Ross Williams. */ 
/* Date  : December 1989. *Y 
/* ey: 
/* This header file defines the interface to a set of functions called */ 
/* ‘compress’, each member of which implements a particular data */ 
/* compression algorithm. */ 
/* we 
/* Normally in C programming, for each .H file, there is a corresponding */ 
/* .C file that implements the functions promised in the .H file. #} 
/* Here, there are many .C files corresponding to this header file. */ 
/* Each comforming implementation file contains a single function a] 
/* called 'compress’ that implements a single data compression algorithm */ 
/* that conforms with the interface specified in this header file. wa 
/* Only one algorithm can be linked in at a time in this organization. a 
/* */ 
[FOR III IKI IIR III IIR III RII IO RICK ITOK RR IK A A II / 
/* ay 
/* DEFINITION OF FUNCTION COMPRESS */ 
/* gee sp oases ee ae eG eee */ 
/* */ 
/* Summary of Function Compress */ 
Pe ee te */ 
/* The action that 'compress’ takes depends on its first argument called */ 
/* ‘action’. The function provides three actions: at 
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/* */ 
p* - Return information about the algorithm. */ 
1 - Compress a block of memory. */ 
/* - Decompress a block of memory. */ 
/* */ 
/* Parameters yf. 
[B Seee eS */ 
/* See the formal C definition later for a description of the parameters. */ 
/* */ 
/* Constants */ 
(ky sa=s=2455 */ 
/* COMPRESS_OVERRUN: The constant defines by how many bytes an algorithm */ 
/* is allowed to expand a block during a compression operation. a 
/* */ 
/* Although compression algorithms usually compress data, there will */ 
/* always be data that a given compressor will expand. Fortunately, the */ 
/* degree of expansion can be limited to a single bit, by copying over ef: 


/* the input data if the data gets bigger during compression. To allow wee 
/* for this possibility, the first bit of a compressed representation can */ 
/* be used as a flag indicating whether the input data was copied over, *} 


/* or truly compressed. In practice, the first byte would be used to ih 
/* store this bit so as to maintain byte alignment. *f 
/* */ 


/* Unfortunately, in general, the only way to tell if an algorithm will */ 
/* expand a particular block of data is to run the algorithm on the data. */ 
/* Tf the algorithm does not continuously monitor how many output bytes */ 


/* it has written, it might write an output block far larger than the a 
/* input block before realizing that it has done so. On the other hand, */ 
/* continuous checks on output length are inefficient. Ay 
/* */ 
/* To cater for all these problems, this interface definition: a 
/* > Allows a compression algorithm to return an output block that is up */ 
/* to COMPRESS_OVERRUN bytes longer than the input block. ia 


/* > Allows a compression algorithm to write up to COMPRESS_OVERRUN bytes */ 
i* more than the length of the input block to the memory of the output */ 


ye block regardless of the length of the output block eventually *f 
/* returned. This allows an algorithm to overrun the length of the e/, 
/* input block in the output block by up to COMPRESS_OVERRUN bytes */ 
/* between expansion checks. wa 
f* */. 
/* The problem does not arise for decompression. i); 
/* wi 
/* Identity Action ay 
ei eee ee */ 
/* > action must be COMPRESS_ACTION_IDENTITY. */ 
/* > wrk_mem must point to a pointer to struct compress_identity. */ 
/* > The value of the other parameters does not matter. ¥f 
/* > After execution, p=*((struct identity **) wrk_mem) is a pointer af 
/* toa structure of type compress_identity. Moe 
/* Thus, for example, after the call, p->memory will return the number */ 
pe of bytes of working memory that the algorithm requires to run. */ 
/* > The values of the identity structure returned are fixed constant af 
i* attributes of the algorithm and must not vary from call to call. ea 
fe we 
/* Common Requirements for Compression and Decompression Actions a 
[RF eapeeet tenes eee ol ee ee ee ed ee */ 
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> wrk_mem must point to an unused block of memory of a length *f 
specified in the algorithm’s identity block. The identity block can */ 
be obtained by making a separate call to compress, specifying the =e 
identity action. */ 

> The INPUT BLOCK is defined to be Memory[src_addr,src_addr+src_len-1].*/ 

> dst_len will be used to denote *p_dst_len. as 
> dst_len is not read by compress, only written. */ 
> The value of dst_len is defined only upon termination. */ 
> OUTPUT BLOCK is defined to be Memory [dst_addr,dst_addr+dst_len-1]. */ 

*/ 

Compression Action i: 

Mile Red te en tek */ 

> action must be COMPRESS_ACTION_COMPRESS. * 

> src_len must be in the range [0,COMPRESS_MAX_ORG]. */ 

> The OUTPUT ZONE is defined to be */ 
Memory [dst_addr, dst_addr+src_len-1+COMPRESS_OVERRUN] . ey. 

> The function can modify any part of the output zone regardless of */ 
the final length of the output block. */ 

> The input block and the output zone must not overlap. */ 
> dst_len will be in the range [0,src_len+COMPRESS_OVERRUN] . */ 
> dst_len will be in the range [0,COMPRESS_MAX_COM] (from prev fact). */ 
> The output block will consist of a representation of the input block.*/ 
*/ 

Decompression Action */ 

Aleta heen ae hae eae */ 

> action must be COMPRESS_ACTION_DECOMPRESS. +7 

> The input block must be the result of an earlier compression op. */ 
> If the previous fact is true, the following facts must also be true: */ 
> src_len will be in the range [0,COMPRESS_MAX_COM]. */ 

> dst_len will be in the range [0,COMPRESS_MAX_ORG]. */ 

> The input and output blocks must not overlap. */ 
> Only the output block is modified. */ 
> Upon termination, the output block will consist of the bytes */ 
contained in the input block passed to the earlier compression op. */ 
ef. 


[FOR IIR IIR I IIR IIRC ICICI IIR RICK R ITOK RR II KA A II / 


#include "port.h" 


#de 
#de 
#de 


#de 
#de 
#de 


#de 


/* 
/* 
/* 
/* 
/* 
/* 


fine COM 
fine COM 
fine COM 


fine COM 
fine COM 
fine COM 


fine COM 


Pp 


Pp 


Pp 


Pp 


Pp 


Pp 


Pp 


R 


ESS_ACTION_IDENTITY 0 
ESS_ACTION_COMPRESS 1 
ESS_ACTION_DECOMPRESS 2 


ESS_OVERRUN 


024 


ESS_MAX_COM 0x70000000 


ESS_MAX_ORG 


(COMPRESS_MAX_COM-COMPRESS_OVERRUN) 


ESS_MAX_STRLEN 255 


he following structure provides information about the algorithm. 


> The top bit of id must be zero. The remaining bits must be chosen 
by the author of the algorithm by tossing a coin 31 times. 

> The amount of memory requested by the algorithm is specified in 
bytes and must be in the range [0,0x70000000]. 

> All strings s must be such that strlen(s)<=COMPRESS_MAX_STRLEN. 

struct compress_identity 
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*/: 
27. 
fh 
ah 
ay 
*/ 
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ULONG id; /* Identifying number of algorithm. Ky 
ULONG memory; /* Number of bytes of working memory required. */ 
char *name; /* Name of algorithm. *f 
char *version; /* Version number. «f 
char *date; /* Date of release of this version. */ 
char *copyright; /* Copyright message. */ 
char *author; /* Author of algorithm. "7 
char *affiliation; /* Affiliation of author. Lap 
char *vendor; /* Where the algorithm can be obtained. ah 
hi 
void compress ( /* Single function interface to compression algorithm. */ 
UWORD action, /* Action to be performed. */ 
void *wrk_mem, /* Working memory temporarily given to routine to use. */ 
/* If action=..IDENTITY => Adr of id structure. */ 
UBYTE *src_adr, /* Address of input data. */ 
ULONG src_len, /* Length of input data. ah 
UBYTE *dst_adr, /* Address of output data. ae 
ULONG *p_dst_len /* Pointer to a longword where routine will write: id 


/* If action=..COMPRESS => Length of output data. */ 
if If action=..DECOMPRESS => Length of output data. */ 
i 


The port header 


In the original version, a fast copy routine specific to the 68000 was used. Ap- 
propriate definitions for other platforms have been placed in this header file. 


[FOR IIRC RIOR II IIR ICICI RII RIOR ITOK RR IK A A II / 


/* */ 
/* PORT.H ef 
had */ 
[FOR III III IIR III ICICI I ITOK RICO RI IOI RR I IO KA A II / 
/* */ 
/* This module contains macro definitions and types that are likely to ars 
/* change between computers. ay 
/* ei 


[FOR IIR III IIIT I III ICICI IIR III I IOC RIK A A II / 


ifndef DONE_PORT /* Only do this if not previously done. ee 
#define UBYTE unsigned char /* Unsigned byte */ 
#define UWORD unsigned int /* Unsigned word (2 bytes) */ 
define ULONG unsigned long /* Unsigned word (4 bytes) */ 
define LOCAL static /* For non-exported routines. a 
#define EXPORT /* Signals exported function. */ 


#ifndef TRUE 
define TRUE 1 
#tendif 


ifdef HAVE_FAST_COPY_H 


include "fast_copy.h" 
#else 
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ifdef USE_BCOPY 
define fast_copy bcopy 
# else 
include <string.h> 
define fast_copy(src, dst, len) memcpy(dst, src, len) 
endif 
endif 


define DONE_PORT /* Don’t do all this again. af 
endif 


Testing 


A very simple program using two copies of the “She sells...” fragment as the 
src test string will illustrate the process. The calls may look like: 


/* Retrieve a pointer to the compress_identity structure. */ 
struct compress_identity *p; 
compress (COMPRESS_ACTION_IDENTITY, &p, NULL, 0, NULL, NULL); 


/* allocate p->memory bytes for wrk_mem, etc. Then call the 

* compress routine. 

*/ 

compress (COMPRESS_ACTION_COMPRESS, wrk_mem, src, src_len, dst, &dst_len); 


/* The bytes in dst should be examined. The source can be recovered 
* with the following call. 
a/ 
compress (COMPRESS_ACTION_DECOMPRESS, wrk_mem, dst, dst_len, src, &src_len); 


After dst is filled by the second call to compress, it is illustrative to print the 
bytes and verify the contents. Also, the process might be repeated using a single 
copy of the “She sells...” example. After the third call to compress, the original 
string and length should be recovered in src and src_len, respectively. 
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Resources, Patents, and Illusions 


This appendix is divided into three sections. The first contains information on 
finding software and other resources. The second introduces some of the patent 
issues which affect developers and users of compression algorithms. The final 
section is presented as somewhat of an amusement, and shows what happens 
when lack of basic mathematical reasoning is combined with advertising. It 
is centered on what is known as the “WEB Compressor,” but it is a story that 
seems to be repeated regularly. 


———EE==_ 


C.1 Resources 


Vast amounts of information and source code may be found via computer using 
standard retrieval methods. This section lists some links which the authors have 
found useful, along with the site for material for this book. 


Documentation and scripts for this book Material directly related to this 
book is maintained on 


http://www.dms.auburn.edu/compression 


which is also visible by anonymous ftp under pub/compression. The doc- 
umentation and scripts discussed in Appendix A appear in the “jpegtool’ 
subdirectory. 


Frequently Asked Questions The FAQ (maintained by Jean-loup Gailly) for 
the newsgroups comp.compression and comp.compression.research is a 
good source for introductory material, pointers to source code, references, 
and other information. It is posted regularly on the newsgroups, and may 
also be obtained via http://www.faqs.org in compression-faq. The site 
contains FAQs for many newsgroups. 


Arithmetic coding The implementation from [84] may be found via 


ftp://ftp.cpsc.ucalgary.ca/projects/ar.cod/ 
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A separate implementation of the same coding scheme can be found in 
the book by Nelson and Gailly [53]. The code from [48] can be found on 
Moffat’s page http://www.cs.mu.oz.au/“alistair. 


Barnsley and Hurd [5] present arithmetic coding in the language of frac- 
tals. 


Fractal image encoding Yuval Fisher’s page contains a wealth of information 
and pointers to fractal methods: 


http://inls.ucsd.edu/Research/Fisher/Fractals 


His book [20] contains a nicely-done introduction to the topic. The Wa- 
terloo Fractal Compression Project at http://links.uwaterloo.ca is another 
large page. Included is a pointer to the “Waterloo BragZone” which in- 
troduces a test suite and includes test results from various coders. 


The second edition of [53] includes a chapter on fractal compression (by 
Jean-loup Gailly). On-line information is available through Nelson’s page 
and http://www.teaser.fr/“jlgailly/. 


The GNU Project and the Free Software Foundation started in 1984 to de- 
velop a complete free Unix-like operating system. A number of software 
packages (including Octave) used by the authors of this book are released 
under the GNU General Public License. Information about the Project 
and the FSF is available through http://www.fsf.org. 


Info-ZIP This group supports the Zip and UnZip programs, widely used on 
many platforms. Their page contains pointers to source code and doc- 
umentation, and information about the authors: http://www.info-zip.org. 
The zlib compression library uses the same algorithm as Zip and gzip, 
and the documentation may be of interest: http://www.gzip.org/zlib/. 


JPEG The images in Figure 10.13 were generated with release 6 software from 
the Independent JPEG Group (IJG). Their software, along with a revised 
version of [77], errata for the first printing of [57], and other information 
is available via ftp://ftp.uu.net/graphics/jpeg. 


League for Programming Freedom The LPF is an organization that opposes 
software patents and user-interface copyrights (but is not opposed to 
copyright on individual programs), http://Ipf.ai.mit.edu/. 


Mark Nelson maintains a collection of his articles, source code, information 
on books (such as [53]), and other notes via http://marknelson.us. 


US Patent and Trademark Office A searchable database of patent informa- 
tion is available via http://www.uspto.gov. 
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Portable Network Graphics PNG is designed as a GIF successor, and uses 
the LZ77-variant found in gzip: http://www.libpng.org. A short history 
of PNG may be found in [60]. 


Wavelets The Wavelet Digest at http:/www.wavelet.org may be a good starting 
point. Colm Mulcahy’s Mathematics Magazine article [52] contains an 
elementary introduction. A number of his papers, along with Matlab code 
and images, are available from http://www.spelman.edu/colm. 


The images in Figures 10.18 and 10.19 were generated with Geoff 
Davis’ Wavelet Image Compression Construction Kit, available through 
http://www.cs.dartmouth.edu/ gdavis. 


Ross Williams opened “Dr Ross’s Compression Crypt” as the first edition of 
this book was going to press. http://www.ross.net/compression/ contains 
notes and sources for his work on various compression-related topics, 
including the LZRW family of algorithms. 


=, 


C.2 Data compression and patents 


The area of patents is a minefield for those interested in data compression. In 
testimony prepared by the LPF for the 1994 Patent Office Hearings, Gordon 
Irlam and Ross Williams write: 


As a result of software patents, many areas of software development are 
simply becoming out of bounds. A good example is the field of text data 
compression. There are now so many patents in this field that it is virtually 
impossible to create a data compression algorithm that does not infringe 
at least one of the patents. It is possible that such a patent-free algorithm 
exists, but it would take a team of patent attorneys weeks to establish this 
fact, and in the end, any of the relevant patent holders would be able to 
launch a crippling unfair lawsuit anyway. ! 


Companies such as Oracle, Adobe, and Autodesk presented testimony against 
software patents; on the other side were companies such as IBM, Intel, Mi- 
crosoft, and SGI. There were middle-ground positions: Sun testified that “the 
[patent] system is indeed broken and needs addressing,” but did not call for 
elimination. 

Donald E. Knuth, in a letter to the patent office, writes 


In the period 1945-1980, it was generally believed that patent law did 
not pertain to software. However, it now appears that some people have 
received patents for algorithms of practical importance—e.g., Lempel- 
Ziv compression and RSA public key encryption—and are now legally 


‘From “Software Patents: An Industry at Risk” by Gordon Irlam and Ross Williams. 
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preventing other programmers from using these algorithms...If software 
patents had been commonplace in 1980, I would not have been able to cre- 
ate [TgX], nor would I probably ever have thought of doing it, nor can I 
imagine anyone else doing so...The basic algorithmic ideas that people are 
now rushing to patent are so fundamental, the result threatens to be like 
what would happen if we allowed authors to have patents on individual 
words and concepts...There are far better ways to protect the intellectual 
property rights of software developers than to take away their right to use 
fundamental building blocks.2 


Perhaps the best known patent problem (other than the infamous exclusive- 
or patent)’ concerns an LZ78-type scheme (Lempel-Ziv-Welch). The scheme 
is widely used, but the general internet user probably only heard of the patent 
problem when Unisys pressed for royalties in late 1994 in connection with the 
GIF graphics format:* 


The LZW algorithm used in compress is patented by IBM and Unisys. It 
is also used in the V.42bis compression standard, in Postscript Level 2, 
in GIF and TIFF. Unisys sells the license to modem manufacturers for a 
onetime fee. CompuServe is licensing the usage of LZW in GIF products 
for 1.5% of the product price, of which 1% goes to Unisys; usage of LZW 
in non-GIF products must be licensed directly from Unisys. 


And, as an example of the patent mess, 


The IBM patent application was first filed three weeks before that of Unisys, 
but the US patent office failed to recognize that they covered the same algo- 
rithm. (The IBM patent is more general, but its claim 7 is exactly LZw.)° 


To be precise, the patent office maintains that algorithms are not patentable, 
but an algorithm used to solve some particular problem is considered patentable. 
Irlam and Williams write: “Thus the ‘RSA algorithm’ is not patentable, but “use 
of the RSA algorithm to encrypt data’ is patentable...For all practical purposes, 
such patents can be considered patents on algorithms.” 

The Stac—Microsoft lawsuit involved an LZ77-type scheme: 


Waterworth patented® the algorithm now known as LZRW1 (the “RW” is 
because Ross Williams reinvented it later and posted it on comp.compres- 
sion on April 22, 1991). The same algorithm has later been patented by 


2Reported in Programming Freedom, the Newsletter of the League for Programming Freedom, 
February 1995. 

34,197,590 Method for dynamically viewing image elements stored in a random access memory 
array, filed Jan 19, 1978, granted Apr 8, 1980. Cadtrack has collected large sums of money and suc- 
cessfully defended this patent which includes claims of “XOR feature permits part of the drawing 
to be moved or ‘dragged’ into place without erasing other parts of the drawing.” 

4 & short note on the Unisys action and an introduction to software patent issues can be found in 
the March 1995 issue of Scientific American [10]. A new graphics specification, Portable Network 
Graphics (PNG or “ping”’), was developed partly in response to the Unisys action. PNG is a lossless 
scheme with more capabilities than GIF. 

5The patents are 4,814,746 (IBM) and 4,558,302 (Unisys). Much of this patent information 
comes from the FAQ maintained by Jean-loup Gailly, and from the LPF. 

64,701,745 Data compression system, filed Mar 3, 1986, granted Oct 20, 1987. 
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Gibson & Graybill.’ The patent office failed to recognize that the same 
algorithm was patented twice, even though the wording used in the two 
patents is very similar. 


The Waterworth patent is now owned by Stac Inc., which won a lawsuit 
against Microsoft, concerning the compression feature of MSDOS 6.0. 
Damages awarded were $120 million. (Microsoft and Stac later settled 
out of court.) 


The Gibson & Graybill patent is very general and could be interpreted 
as applying to any LZ algorithm using hashing (including all variants of 
LZ78). However, the text of the patent and the other claims make clear 
that the patent should cover the LZRW1 algorithm only. (In any case the 
Gibson & Graybill patent is likely to be invalid because of the prior art in 
the Waterworth patent.) 


The LZRW1 scheme was presented by Williams in [83]. The original GNU 
zip (gzip) was to have used LZRW1. Patents on arithmetic coding affect the 
graphics compression scheme known as JPEG: 


IBM holds many patents on arithmetic coding.8 It has patented in par- 
ticular the Q-coder implementation of arithmetic coding. The arithmetic 
coding option of the JPEG standard requires use of the patented algorithm. 
No JPEG-compatible method is possible without infringing the patent, be- 
cause what IBM actually claims rights to is the underlying probability 
model (the heart of an arithmetic coder). 


From the the documents in the Independent JPEG Group’s source distribution: 


It appears that the arithmetic coding option of the JPEG spec is covered 
by patents owned by IBM, AT&T, and Mitsubishi...For this reason, sup- 
port for arithmetic coding has been removed from the free JPEG software. 
(Since arithmetic coding provides only a marginal gain over the unpatented 
Huffman mode, it is unlikely that very many implementations will support 
it.) 


More information and references (on both sides of the patent issue) can be 
found in the LPF materials. 


75,049,881 Apparatus and method for very high data rate-compression incorporating lossless 
data compression and expansion utilizing a hashing technique, filed Jun 18, 1990, granted Sep 17, 
1991. 

8Here’s a few from the FAQ: 4,286,256 Method and means for arithmetic coding using a reduced 
number of operations, granted Aug 25, 1981. 

4,463,342 A method and means for carry-over control in a high order to low order combining of 
digits of a decodable set of relatively shifted finite number strings, granted Jul 31, 1984. 

4,467,317 High-speed arithmetic compression using concurrent value updating, granted Aug 21, 
1984. 

4,652,856 A multiplication-free multi-alphabet arithmetic code, granted Feb 4, 1986. 

4,935,882 Probability adaptation for arithmetic coders, granted Jun 19, 1990. 
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C.3 Illusions 


Webster defines an illusion as a “mistaken idea,” and the field of data compres- 
sion has had a few amusing cases. The specific example given here concerns 
the “WEB compressor,” but the claims are perhaps the classic ones presented 
by those who haven’t done their homework. 

The FAQ has additional material on this topic. Concerning the WEB fiasco, 
Jean-loup Gailly writes: 


Such algorithms are claimed to be applicable recursively, that is, apply- 
ing the compressor to the compressed output of the previous run, possibly 
multiple times. Fantastic compression ratios of over 100:1 on random data 
are claimed to be actually obtained. 


Such claims inevitably generate a lot of activity on comp.compression, 
which can last for several months. The two largest bursts of activity were 
generated by WEB Technologies and by Jules Gilbert. Premier Research 
Corporation (with a compressor called MINC) made only a brief appear- 
ance. 


Other people have also claimed incredible compression ratios, but the pro- 
grams (OWS, WIC) were quickly shown to be fake (not compressing at 
all). 


The story 


The claims made by WEB Technologies are not unique, and certainly illustrate 
that sometimes not even common-sense analysis is performed. According to 
BYTE: 


In an announcement that has generated quite a bit of interest, and more 
than a healthy dose of skepticism, WEB Technologies (Smyrna, GA) says 
it has developed a utility that will compress files larger than 64KB to about 
one-sixteenth their original size. Furthermore, WEB says its DataFiles/16 
program can compress files that the program has already compressed. 


We might be willing to play along at this stage. Perhaps the announcement 
exaggerates a little, and the company meant to say that it can achieve very good 
compression on a large class of files. The last sentence is also cause for concern, 
although it is not proof that the claims are completely bogus. However, the 
article goes on to say: 


In fact, according to the company, virtually any amount of data can be 
compressed to under 1024 bytes by using DataFiles/16 to compress its 
own output files multiple times. 


9“Tnstant Gigabytes?”, BYTE Magazine 17(6):45, June 1992. © by The McGraw-Hill Compa- 
nies, Inc. All rights reserved. Used by permission. 
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According to the FAQ, the company’s promotional materials clearly indi- 
cate that this is a lossless scheme: 


DataFiles/16 will compress all types of binary files to approximately one- 
sixteenth of their original size...regardless of the type of file (word pro- 
cessing document, spreadsheet file, image file, executable file, etc.), no 
data will be lost by DataFiles/16 [for files of at least 64K]. 


Performed on a 386/25 machine, the program can complete a compres- 
sion/decompression cycle on one megabyte of data in less than thirty sec- 
onds. 


The compressed output file created by DataFiles/16 can be used as the in- 
put file to subsequent executions of the program. This feature of the utility 
is known as recursive or iterative compression, and will enable you to com- 
press your data files to a tiny fraction of the original size. In fact, virtually 
any amount of computer data can be compressed to under 1024 bytes us- 
ing DataFiles/16 to compress its own output files multiple times. Then, by 
repeating in reverse the steps taken to perform the recursive compression, 
all original data can be decompressed to its original form without the loss 
of a single bit. 


The report in the FAQ goes on to say “Decompression is done by using only the 
data in the compressed file; there are no hidden or extra files.” 

The company apparently failed to make even the most basic mathemati- 
cal analysis. A simple counting argument (such as that contained in the next 
section) would have convinced them to review their statements. 


The counting argument!° 


The WEB compressor was claimed to compress without loss all files of greater 
than 64KB in size to about 1/1 6th their original length. A very simple counting 
argument shows that this is impossible, regardless of the compression method. 
It is even impossible to guarantee lossless compression of all files by at least 1 
bit. (Many other proofs have been posted on comp.compression, please do not 
post yet another one.) 

Assume that the program can compress without loss all files of size at least 
N bits. Compress with this program all the 2 files which have exactly N 
bits. All compressed files have at most N — 1 bits, so there are at most 2% — 1 
different compressed files (2% —! files of size N — 1,2"? of size N — 2, and so 
on, down to | file of size 0). So at least two different input files must compress to 
the same output file. Hence the compression program cannot be lossless. (Much 
stronger results about the number of incompressible files can be obtained, but 
the proofs are a little more complex.) 

This argument applies of course to WEB’s case (take N = 64K -8 bits). 
Note that no assumption is made about the compression algorithm. The proof 


10 Contributed by Jean-loup Gailly. Used by permission. 
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applies to any algorithm, including those using an external dictionary, or re- 
peated application of another algorithm, or combination of different algorithms, 
or representation of the data as formulas, etc. All schemes are subject to the 
counting argument. There is no need to use information theory to provide a 
proof, just basic mathematics. 

This assumes, of course, that the information available to the decompressor 
is only the bit sequence of the compressed data. If external information such as 
a file name, a number of iterations, or a bit length is necessary to decompress 
the data, the bits necessary to provide the extra information must be included 
in the bit count of the compressed data. Otherwise, it would be sufficient to 
consider any input data as a number, use this as the file name, iteration count 
or bit length, and pretend that the compressed size is zero. For an example 
of storing information in the file name, see the program ‘Imfjyh’ in the 1993 
International Obfuscated C Code Contest, available on all comp.sources.misc 
archives (Volume 39, Issue 104). 

A common flaw in the algorithms claimed to compress all files is to assume 
that arbitrary bit strings can be sent to the decompressor without actually trans- 
mitting their bit length. If the decompressor needs such bit lengths to decode 
the data (when the bit strings do not form a prefix code), the number of bits 
needed to encode those lengths must be taken into account in the total size of 
the compressed data. 


Conclusion 


To get a more complete story, we recommend reading the BYTE article and the 
FAQ. The folks at BYTE were clearly skeptical, and reported: 


[A beta-test version] did create archive files that were compressed to the 
degree that the company claimed. The beta version decompressed these 
files into their original names and sizes, but, unfortunately, the contents of 
the decompressed files bore little resemblance to that of the original files. 


The FAQ reported: 


[WEB] now says that they have put off releasing a software version of 
the algorithm because they are close to signing a major contract with a 
big company to put the algorithm in silicon. He said he could not name the 
company due to non-disclosure agreements, but that they had run extensive 
independent tests of their own and verified that the algorithm works. 


He said the algorithm is so simple that he doesn’t want anybody getting 
their hands on it and copying it even though he said they have filed a patent 
on it. [He] said the silicon version would hold up much better to patent 
enforcement and be harder to copy. 


He claimed that the algorithm takes up about 4K of code, uses only integer 
math, and the current software implementation only uses a 65K buffer. He 
said the silicon version would likely use a parallel version and work in 
real-time. 
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Our favorite statement from the BYTE article is: 


According to... WEB Technologies’ vice president of sales and marketing, 
the compression algorithm used by DataFiles/16 is not subject to the laws 
of information theory. 


Also, “The company’s spokespersons have declined to discuss the nature of 
the algorithm” and, of course, there was no product. BYTE did a followup, 
reporting: 


WEB said it would send us a version of the program that worked, but we 
never received it. 

When we attempted to follow up on the story about three months later, 
the company’s phone had been disconnected. Attempts to reach company 
officers were also unsuccessful. WEB appears to have compressed itself 
right off the computing radar screen. !! 


Concerning stories such as the WEB compressor, Gailly adds: “similar 
affairs tend to come up regularly on comp.compression. The advertised revolu- 
tionary methods have all in common their supposed ability to compress signif- 
icantly random or already compressed data. I will keep this item in the FAQ to 
encourage people to take such claims with great precautions.” 

The US patent office apparently doesn’t read the FAQ. In July 1996, they 
granted a patent (5,533,051) on a “Method for Data Compression” that repeats 
several of the mathematically impossible claims discussed in the WEB story. 
Gailly has an analysis on his page (see Section C.1 or the FAQ). 


1ewhatever Happened To...WEB Technologies’ Amazing Compression?” BYTE Magazine 
20(11):48, November 1995. © by The McGraw-Hill Companies, Inc. All rights reserved. Used by 
permission. 
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Notes on and Solutions to Some 
Exercises 


1.1 Introduction 


2. 15 


1.2 Events 


1. 30% 

2. No. Leta = P(HH),b= P(AT), c= P(TA), and d = P(TT). The consid- 
erations posed are: b=c anda+b=a+c= 1/2. In addition, a,b,c,d >0 
anda+b+c+d= 1. Do these requirements determine a,b,c, and d? No: for 
instance,a =b=c=d=1/4anda=d=1/8,b=c=3/8 are two different 
probability assignments satisfying the requirements. 

Notice that even requiring, in addition, that a = d and thatb+d=c+d=1/2 
will not suffice to determine a,b,c, and d. 


3. 65%. With S standing for the population and A, B, C having the obvious mean- 
ings, 
P(S\(AUBUC)) =1— P(AUB)UC) 
=1-—[P(AUB)+ P(C)— P(AUB)NO)] 
= 1—[P(A)+ P(B)+ P(C) — P(ANB)— P(ANC)U(BNC))] 
= 1—[P(A)+ P(B)+ P(C) — P(ANB)— P(ANC)— P(BNC) 
+ P(ANBNC)] 
30 10 12 8 7 4 2 65 


— [oo 100° 100 100 100 100 Too! = 100 


1.3 Conditional probability 
1 113, 3-11 11-10 _ 36 


; ae ae a 


11-10-9 _ 199 
2. 1— 14-13-12 — ae 


3. 1-4 = i 


343 
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99,999 1 249,999 
4. (a) 700.000 000 + 790,000 100,000 150,000 — 15,000,000,000 


99,999 1 
(b) 00,000 T50,000 99,999 
(24 


9,999)/15 billion — 249,999 
5. P(A | green) = Tas = a , P(red) = 35 + : + i= = a. proportion 
of red balls = 1731 


1.4 Independence 


3. S, Y, {a,c}, {a, d}, {b,c}, {b, d} 
4. S,@ 


5. Urn C will contain three red and five green balls. [Let x be the number of 
red and y the number of eieeu oa in urn C. The independence requirement 


y y Wan ase Vesoc 95 
translates into the equation ; we 334 +1+ rey Solve for ep iy Ss 
The positive integers x, y satisfying ne equation with x + y smallest are x = 3, 
y=5.] 


1.5 Bernoulli trials 


4 fa 
LOBE O1-1+8O) OL WA 


k=0 
8) 1 7 e 8) 1 219 1 
2. (a) (3) 35 = 39 (D)1— Vi ()ax = 356 O£> (= 356 
3 P(n+1 heads in 2n+2 flips) _ 1 (2n+2)! |e 1 (Qn)! _ 2n+1 1 
P(headsin2n flips) 22°72 (n+)2/ 2 (n)2 ~ 2Int2 ~ 


1.6 An elementary counting principle 
15) (8 
3. (s)() 

1.7 On drawing without replacement 


2. (a) OC) — (b) 443442 COG@)G — _56_ 


ae =25 G ) ~ 2185 


1.8 Random variables and expected, or average, value 


42% wo? 

5. The probability of an atom not decaying in 24 hours (four successive 6-hour 
periods) is (1 — p)*. Therefore, the probability of an atom decaying in 24 hours 
is 1—(1— p)*. Letn be the number of undecayed atoms present at the beginning 
of the 24-hour period: n(1 — (1 — p)*) = 4 implies p = 1—(3)!/*. 
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2.1 How is information quantified? 


1. (a) I(Ea, Fe) = 37g log 


(b) I( Fe \Fa) = loe-7 
(c) [(Ea|Fe) = log 8(8 +443) 


2.2 Systems of events and mutual information 


1. Sz and $4; $3 and S4 
ait 1/36 1/36 1/36 1/36 
3. IE, F) = 36 [log (1/6) /36) + log (1/6)(2/36) +log 1/6)G/36) +log 1176)(4/36) + 
1/36 1/36 /36 1/36 
los TAG RH t 8 TH ERw t 8 THaRH t+ + los TEE RH + 
1/36 1/36 1/36 ae 
log aja +++ los ashore ++ +108 aise] = 
4 [24log3 + 10log2 — 10log5] 


4. ME,F) = 4 (1/3)(11/18) ieee (1/3)7/18) 
ee ee SI: RETA ED + 18 °8 1/3) /3)qqt et) 
(1/3)(4/5) in (1/3)(1/5) 

Wane 3s °8 G7d/3)\GR € 
(1/3)(1/4) (1/3)(3/4) 

(1/3)(0/3) +244) (1/3) 3/4) Rg 


(If simplification is desired, hire a small child.) 


a 


log 


wl 
ESTes) 
~ 


3 
I 


4 
5 
1 
34 log 


+ zFlog 


wl 
ESTes) 
VY 


7. I(E, E) = Oif and only if € contains one event of probability one, with the other 
events, if any, in € necessarily of zero probability. 


2.3 Entropy 


2: H@=->(} Jo (1-p)"" ‘og (1 ) 0 (1— py" * 
k=0 


H(S)=-)- (;) pX(1— p)" ‘log p*(1 — p)"* 
k=0 


P(E; N Ej) 
3, 1(€,6)= P(E; N E;) log ———_— 
dd, * P(E;) P(Ej) 
=yP ENE P(E, 0 Ej) F . ; P(E;NE;) =0 
2. (Ei Ej) (8 BE? (since i # j => P(E; N Ej) =0) 
1 
= =e ;) log —— Tea = H(E). 


5. By Theorem 2.4.4, H(E|F) = H(E)-1(E,F) = H(E)S T(E, FVD HOSE 
and F are statistically independent (Theorem 2.2.13). 
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2.4 Information and entropy 


5. Necessary and sufficient: both € and F are the trivial sort of system with exactly 
one event of probability | and the others of probability zero. 

6. H(EAE)=H(E), H(E|E)=0 

7. H(E) =\log3, 
HP) =U + 4+ Alot G++ At td+ Host G+ t+ Hh 
I(E,F) = zz log r erCraeE 4 7 log 7 sree 4 Fig log apt 
+43 log+ 1/3)(5/8)" +42 log -Upa/s) +4 flog (1/3)(6/13) 


T1(3424 6) 1134348 T1(3424 8)’ 


HEAP = Hens cay HWE | F) = H(E€)—I(€,F), 
H(F |\€)=H(F)-1(E,F). 


8. Necessary and sufficient: € is an amalgamation of F. 


9. (b) Put all the balls of one color in B, and all the balls of the other color in C; 
I(E,F) =H¢(€). 
(c) Make the proportions of red and green balls in B and C the same. (In this 
case, it will be necessary to make the numbers of red and green balls equal in 
each urn.) 1(€, F) = 


3.1 Discrete memoryless channels 


2. (a) p’>q(i—p)U—q)  (b) pq? — p)? 
(c) 3p*g(1— pp) — 4) +3pq°(1 — p)? +3p"q(1 — p)? + p2(1— p)(1—@) 
(d) Forn = 1, the answer is 1. Forn > 2, the answer is pep "A —p)t+t(n— 
1) p"-2g(1 — p) + (n—2)p"-3q(1 — p)(1 —q) + ("3° p"4G7 = py + 
(n—2)p"3q(1— py? + p” 2(1— p)U—4q). 
3. (a) p(—p)? (b) p> p)? 
() 3)p3—p)? dd) p" + np"! — p)+ (6) p"-2. = py? 


3.2 Transition probabilities and binary symmetric channels 


_ {900 901 ox}  |}P q 7 2) n—key _ yk 
. @o=|™ qu Pale P i (b) (i) p" “A p) 


ptr/2 qtr/2], : qtr]. 
(c) ? +r/2 pt+r/2 ; yes, a BSC. (d) ptr ; not a BSC unless r = 


2. @) p= 95)" “Oy p?+1sphd=p)= 85 Cy p= 
3. (a) popi tl — pi (b) p} — po) — p1) 


(c) pap? 2 +-zp2-! p? = (1 — po) + (n—z) pep” = 1 — pi) t+z(n—z) pe | x 
pr! (1 po)(1— pi) + (8) pe pt (1 — po)? + ("5") ppt <= pi)? 


© 2003 by CRC Press LLC 
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= {0, 1}°. Transition probabilities: go00,000 = 111,111 = p; 000,111 = 111,000 
d- py: the other six are either pl — p)or pAi- BY. 


1 0 


B 


. (a) In 3.2.1 (c), B={0,1},U=| 0 1 |. In3.2.1() B=({0,1},U = 


1/2 1/2 


oor 
Re Re © 


3.3 Input frequencies 


nA BR WO Re 


. po = 24/53, py = 29/53 

. Output frequency of 0: (1+ p)/3; output frequency of 1: (2— p)/3. 

. P(b,) = .384, P(ba) = .485, P(b3) = .131 

. (a) Average cost = .38p; + .22p2+.20p3. When pi = .4, p2 = .5, p3 =.1, the 


average cost is .282. (b) Set p3 = 1, pi = p2 = 0 to minimize cost; an unwise 
choice, however, since it renders the channel useless. 


3.4 Channel capacity 


3. 


Si Pe TP = P = I~p 
oe E -a 4 | MA, B)= polPlo8 pope da + 1 — P)108 appa 
5 + glog 


+ pil(l — 4) log ] 
Capacity equations: po + pi = | 
fee ee = lap = 
Plog Pop+pid— ine. p)log pod—p)+piq 
= eS ed =: ae 
(1 — 4) log sor pid=a + 9108 aod= pte = 


q 
Pol—p)+pi9g 


. The capacity equations are 


Pitp2+p3=1 


= 94 04 02 
C=.94log Sap FULTS: + .04log O4peospse Ops + .02log Wp tue oIpy ops Dips 


C=.01log Vp FOI p2F03P3 + .93 log Vapi + 93p+ Ap + .06log a; ops 3p3 
C=.03 log aa FGipst03p; + 04108 Dap Usp aE + .93log ee 06p2+93p3 


. Po = pi = 1/2 are optimal, and the capacity is plog=e at qlog —— we. 


. For 0 < p < 1 (with 0° = 1 in case p = 1), the ieee input frequencies are 


=. 1—(1—p)!/P (1—p)G-P)/P 
Pa = 1+p(l—p)U-P)/P? Pb = 1+-p(—p)"-P)/P? 


p)-P)/P), When p = 0, the capacity is zero, and any relative input frequencies 
are “optimal.” 


and the capacity is log(1 + pd — 


. @) Pa = (2pP— pp)? +1)"|, po = pe = (1— pa)/2, C = log(2p? (1 — 


p)'-P +1). 
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. Capacity = p*log a 


D Notes on and Solutions to Some Exercises 


(b) Pa = pp = 1/2, C = log?2. 
(c) No; and there is equality when and only when p = 1/2. Explanation left to 
you. 


. Pl=r+ = Pn=1/n, C=logn. 


For 1 <j <n—1; pj = eS and pn = pe C = login 1 +0). 


ap3 2(1—p)3 
Stop + (— py log GPs + 3p(1 — pl plog2p + 


(1 — p)log2(1 — p)]. 


. Pi= po2= p3 = 1/3, C =2/3log2 
. Pi=p3=1/2, p»x=0,C= 5 log3 — F log2 


[It is somewhat shocking that one of the optimal input frequencies is zero. We 
are indebted to Luc Teirlinck for this example.] 


4.2 Prefix-condition codes and the Kraft-McMillan inequality 


1. 


(af=5 (b/)£=3 (c)n=3 (d)m=64 


4.3 Average code word length and Huffman’s algorithm 


1. 


(a) There are various correct answers arising from choices made in running 
through Huffman’s algorithm, but the unique sequence of code word lengths 
is 2,2,3,3,3,4,4. One correct answer: e > Ol, a > 10,d > 001, b—> 
110, f > 111, g ~ 0000, c > 0001. 

(b) One correct answer: e > 1, a > *0,d > «1, b > xx, f > 00, g > O01, 
c—> Ox. 


4.4 Optimizing the input frequencies 


1. 


The answers given are not unique, and in (c) and (d), they are debatable. 


(a) e>0l,a— 10,d—- 00, b> 110,c—> 111. 
(b) e > 00,a > 01, d > 11,b > 100, c > 101. 
(c) e>0,a—> 1,d—> *1,b— *0,c > *x. 
(d) e>0,a> *,d—- 10,b—-> 11,c— Is. 


2. Again, the following are not unique, 


(a) e—> 001,a— 110, d > 101, b > O11, c > 000. 
(b) e > 000, a > 001, d > 010, b > 111, c > 100. 
(c) e > Olx, a > 0x1, d > «01, b > +10, c > 10. 
(d) e>0l1,a— 00,d > 11,b > **,c > Ix. 
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4.5 Error correction and reliability 


1. (a) @) Receivew: O00 O1 Ox 10 11 Ix «0 xl xx 
Decodes: ac aoa ob b a eb ea 


(ii) R=.79194 (iii) = .312 


4.6 Shannon’s Noisy Channel Theorem 


1. C = .95log, 1.9 + (.05) log, (.1) © .7136; H = (.5)logy2 + .31log)(.3)7! + 
.2log, 5 * 1.4855; p = 100, if the unit of time is one second. Therefore, the up- 
per limit on the number of source letters per second that the channel can handle, 
with vanishingly small error probability, is ae = 48.0385. 


5.1 Replacement via encoding scheme 


4. 5/3 [With s; = 000,...,sg = 111, the original file parses into the source text 
S8S7S8S8S65758565757 Which is encoded 010001101001101010, 18 bits compared 
to 30 in the original file.] 


5.2 Review of the prefix condition 


1. (a) Add 11; (b) add 010 and 111; (c) add 1111. 


5.3 Choosing an encoding scheme 


1. (a) 2/1.95  (b) 2.7/2.55 
2. (a) Shannon: 2/2.2 (less than one!), Fano: 2/1.95 
(b) Shannon: 2.7/2.95 (again!), Fano: 2.7/2.55 


5.4 The Noiseless Coding Theorem and Shannon’s bound 


1. (a) H = Alog,(.4)~! +.2(.25log,(.25)~!) + .1 log, 10 © 1.8610 
L/H *2/1.861 © 1.0747 
(b) H © 2.5037, L = 2.7, so L/H © 1.0784. 
2. Compute L = 2.6. 
(a) €= 1.9, so L/€~ 1.3684. 


(b) It is a struggle computing @(S7), but here is a tip: it is not necessary to write 
out the full encoding scheme; the code word lengths can be found by counting 
edges along the paths from the terminal (leaf) nodes of the Huffman tree to the 
root node. £(S?) = 3.73, so the compression ratio is 2L/€(S?) = 5.2/3.73 © 
1.3941, assuming those “digram” frequencies are correct. 
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Another labor-saving remark: if the average length of a source letter, in its 
original incarnation as a binary word, is L, then the average length of two of 
them together will be 2Z, and this holds with no assumptions on the relation 


between the digram and the single letter frequencies. 


(c) Because the relative frequency of the digram s;5; is fj f;, by assumption, 
for each i and j, we have H(S*) =2H(S), SO the Shannon bound on the com- 
pression ratio is the same in both cases: 2L/H(S*) = 2L/2H(S) = L/H ~ 


2.6 = ~w 
38 = 1.4081. 


3. H = L so the Shannon bound is L/L = 1. Shannon’s method will give an 
encoding scheme with every word of length L. (In fact, if S = {0, 1}4 is ordered 
correctly, the scheme will be w > w for all w € {0, 1") Thus the compression 
ratio achieved by Shannon’s method is 1. Huffman’s algorithm cannot do worse 
than Shannon’s method, nor better than the Shannon bound, so the compression 
ratio will again be | (and, in fact, all code words will be of length L). The same 
holds for Fano’s method; an easy induction on L shows that the code words in 
the resulting scheme will all have length L. 


6.1 Pure zeroth-order arithmetic coding: dfwld 


1. (a) bbbb > 1, abcd > 00110111, dcba > 111110011, badd + 0111010001 
(b) 11 + cbab, 010001 — acba, 10101 > caaa, 0101 — acdc 


2. (a) bbbb is encoded 1000. The other three words are encoded as in Exercise 
6.1.1(a). 


(b) baaca 


3. acdcaca 


6.4 Implementing arithmetic coding 


1. (a) Next letter 
or rescale New Underflow 
orunderflow L H_ code count 
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The code: 11100110111. Notice that in this problem we have violated the policy 
guideline that EOF should be last in the ordering of the source letters. 

(c) In [0, C), the subintervals [0, 3), [3, 4), [4, 5), [5, 6), and [6, 7) correspond to 
a,b, c, EOF, and d, resp. As in the text, we scale back to these subintervals to 
determine the current symbol (and then (6.1) is used to obtain the new current 


interval). 
current output 
value v interval interval calculation symbol 
“W10s=14. 10.16 wa| Ge DET 7 )=6OCOttt” 
1110,)=14 0,16) w=| Se o—J=|e]=6e 4 
[13, 16) expand x +> 2(x — M/2) 


11002 =12_ [10, 16) expand x +> 2(x — M/2) 


1011y7=11 (0,16) w=|Coe ee) |=|8)=5 zor 


Of course, the decoding could also be done without the calculation of w. As 
in the decoding examples in the text, that would involve the calculation of the 
subintervals of [L, H) corresponding to a, b, c, EOF, and d each time decoding 
is about to take place. 


3. Here, 16 = M = 2”, so condition (6.2) requires log, 5 = log, |S| < c andc < 
m—2= 2, which is not possible. Although the condition in 6.4.5 is not satisfied, 
direct calculation in the critical case when [L, H) = [3.9) or [7, 13), i.e., when 
H — L =6, shows that the last statement in the exercise holds. 


4. Inthe expansion tagged ‘x +> 2x’, the values L and H — | should be replaced by 
2L and 2H — 1, respectively. Shifting left gives the same result as multiplying 
by 2, and the bitwise ‘| 1’ adds 1 to an even number. Hence, L < | = 2L and 
(4H -1)«1|1=(H—-1)24+1=2H —1, as desired. 


6. (a) The corresponding lines in the table are 


strings input P(1|s) C(s) A(s)A(sd) A(sly 
0100010 1 1/22, 01110.1100 .1100 .1001_~—-.0011 
01000101 01111.0101 0011 shift 2 
01000101 0111101.0100 .1100 


(c) The handling of the stuffed bit by the decoder leads to a starting codeword 
of “1000”. 
(d) The increased code length can be estimated. If a limit of k consecutive Is is 


imposed, then a stuffed bit will be inserted every 2* output bits on average. 


7.1 Higher-order Huffman encoding 


1. (a) fi = .35, fo = .32, fg = .23, fa = .10; L = 2.57 and the unique code word 
lengths in the scheme from Huffman’s algorithm are 1,2,3,3, so € = 1.98; the 
compression ratio is L/£ = 2.57/1.98 © 1.298. 
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(b) £(S*) = 3.64; the compression ratio is 2(2.57)/3.64 © 1.4121. Review the 
remarks in the answer to Exercise 5.4.2(b). 


(c) The following schemes are not unique, of course, but the code word lengths 
are, except in context s3. 


Starter scheme: 5s; > 0 Context sj: sy —0 
52 > 10 52> 111 
s3 > 110 53 > 10 
s4 — 111 sq —> 110 
Context 59: sj > 110 Context 53: s,; —> 10 00 
s2 > 0 52> 0 or Ol 
53 > 10 53 > 110 10 
s4 > 111 s4—2 111 11 
Context 54: 5s; > 00 
5s. 7 01 
53 > 10 
s4—> 11 
Encoding AY) RY) Sy] S3 Sy S] S] S3 SQ S3 S3 Sy] S4 
> 10 O 110 10 10 0 O 10 0 10 110 10 110 
1 3 2 3 
(d) [@;;] = a mene Cee >; fij4ij = 1.79. Compression ratio = 
PS WD AL: AB? 831% = Lai Lj Sip tig = 864 Pp = 
22 2. 32; 42 


2.57 ww 
257 we 1.4358. 


2. (a) Letting 2(S?) denote the average length of a code word replacing a digram, 
using the erroneous digram frequencies to produce a scheme via Huffman’s al- 
gorithm and to calculate the average, you obtain £(S”) = 3.7741, for a supposed 
compression ratio of 2(2.57)/3.7741 * 1.3619. 


(b) Using the code word lengths from the scheme alluded to in (a), but using the 
true digram frequencies to compute the average code word length, you obtain 
€(S?) = 3.77, for a compression ratio of 5.14/3.77 © 1.3634. Notice that the 
encoder using the erroneous digram frequencies has done better than he thinks, 
but not as well as the encoder in Exercise 7.1.1(b), using the true digram fre- 
quencies. 


3. In both (a) and (b) the average code word length is 1.98, the same as for zeroth- 
order Huffman encoding, and so the compression ratio is the same as in Exercise 
7.1.1(a). 


This is no accident. A little reflection reveals a moral here, that there is no point 
in attempting higher order encoding with a zeroth-order source. 


4, 0 = 1,2, @ = 1.18, &(S2) = 1.83. 
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7.2 The Shannon bound for higher-order encoding 


lL. H(S)=H(S)= 
.35 logs (.35)~! + .321og,(.32)~! + .23 log, (.23)7! + .1 log, 10 © 1.8760, 
H(S’) = 00; >; fij logy fij © 3.603, H (S) = H(S*) — H(S) © 1.7271, 
L/H© % 2.57/1.876 © 1.3699, L/H) = 2.57/1.7271 © 1.4880. 


2. H x 9219, H(S*) © 1.7012, HS) = H(S?) — H® & .7793. 


7.3 Higher-order arithmetic coding 


1. (a) 52525252 > 1001, 51525354 — 010000011, 
S4S53S2S] > 1111100111, sosjsq54 —> 0111101011 
(b) lls 53515183, 01001 > 5153515], 10101 > 5253515], 0101 > S51 S351 S3 


8.1 Adaptive Huffman encoding 


. 1000011011101001011010011110101001100111001001011 
. 100110110111110010011110011001111011111001011010 


. S281 8285538581518] 


RW NY 


. 828282525381 S2S6S2S2 


8.3 Adaptive arithmetic coding 


1. Note: this exercise was not done using the algorithm of Section 6.4, but rather 
the ivory-tower, “pure” dfwld method of Section 6.1. If you did it via Section 
6.4, your answers will be different. 


(a) bbbb > 01011, abcd + 001001, dcba > 110111, badd — 010011001 
(b) 11 — daaa, 010001 — baad, 10101 —> cccc, 0101 > bbac 


8.4 Interval and recency rank encoding 


1. (a) 001001001110010000010101001000001010001000001001 100100001 1000101000100111 
(b) 11100111101101111100110110111101110111100110111011101111000 

2 (a) S2S6S2S2S5S4S5S5 
(b) 5386545482565283535385 


9.1 LZ77 (sliding window) schemes 


1. Only three pairs are produced, due to the minimum on match lengths. 


2. Lazy evaluation will send ‘a’ as a literal and (offset, length)-pairs for ‘bcde’ 
and ‘fq’. 
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3. Both conventions favor small match distances, hopefully improving compres- 
sion obtained by the Huffman back-end. Note that ignoring length-3 matches 
which are too distant expands the output at the dictionary stage of the scheme 
by 3 bits in the case of no matches starting at any of the 3 characters. 


5. The material surrounding the discussion of Fibonacci hashing (with w = 2!° 
and M = 2!”) in [39] applies. However, the constant A = 40543 does not sat- 
isfy all of the recommendations, and other choices are possible. In experiments 
performed by Williams during the development of LZRW1, the choice did as 
well as 40507 (which corresponds to a value closer to the “golden ratio recom- 
mendation’’) and better than 40637. The tests were not exhaustive, and Williams 
cautions that there may exist formulae of this basic type that do better.! 


9.2 The LZ78 approach 


1. The dictionary is 


Cenig[Pinase ity [Phase 


2. #1, #2, #4, #4, #6, #8, #3, #3, #2. The final trie is 


3. This is an example of the “exceptional case” in LZW; the string ‘abbb’ is ob- 
tained. 


4. An ‘a’ following ‘b’ is allocated 1/3 of the code space or —log, 1/3 * 1.58 
bits. 


10.2 Periodic signals and the Fourier transform 


5. (a) (zu)N = (e2tik/NYN = g2koi — 1, 


! Notes written during the development of LZRW1 on the experimental results were provided by 
Williams. 
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(b) 
N=2 N=3 N=4 N=8 


<1 


<2 


10. Take a = (u, w)/|(u, w)|. Hint: |lu — v||* = 2. 


10.3 The cosine and sine transforms 


1. Cy = [4.74, 12.01, 31.98, —5.27, 1.77, —2.30, 0.15, —1.67] 
Sy = [—5.87, 13.24, 29.97, —0.62, 10.09, —1.39, 3.72, —1.61] 


5. |[I|? = 0,0) = (Ev, Ev) = (v, E Ev) = (v,v) = |lvll? 


10.5 An application: JPEG image compression 


640 0 0 0 213 0 0 0 
|) O:0°0-0 _| 0000 
1. @Tx=!) 9 9 0 0| 27*=| vo000 
0000 0000 
645.00 —3.35 0.50 —0.24 215 -1 0 0 
_ | -335 1.21 0.46 0.00 -1 000 
(b+) Tx=| O50 «046 0.00 -0.19| 27*=] 0 000 
~0.24 0.00 —0.19 —0.21 0 000 
280.00 230.70 40.00 34.33 93 46 6 4 
_ | 126.17 —11.72 126.17 28.28 _|-25 2 14 3 
(©) TX=| 4900 21.65 —40.00 5226] 27*=| 6 2-4 4 
8.97 28.28 8.97 —68.28 a 
279 230 42 36 i6ts 4’ 4:9 
be ea i959 ty 106), SH es = (160s aie 9 Be eh 
Dequantize: Tx=| 4) _ig 44 52|*=| 159 102 -5 0 
9 33. 13-95 161 161 159 —1 
379.75 -118.54 23.25 82.93 127 -2 3 9 
_ | 127.60 —22.91 —22.43 85.83 il 26 eRe 
() TxX=1] 15995 50.24 77.75 a36| 27*=|_ig 6-7 0 
55.83 4.33 61.34 —125.09 6 0-5 -8 


10.6 A brief introduction to wavelets 


1. A threshold value of 10 was chosen, giving the “clipped” matrices below. 


640 0 0 0 640 0 0 0 
0000 


(a) Hx=| 9 9 9 0 


‘ 0000 
clipped Hx = 6 0-640 
0000 
160 160 160 160 
160 160 160 160 


160 160 160 160 
160 160 160 160 


2 
ll 
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645.00 —3.00 —0.71 —1.41 645 0 0 0 
_ | -3.00 1.00 0.71 0 _| 0000 
(b) Hx=) 97) o71 0 o |clippedHx=1 9 9 9 0 
-141 0 0 0 0000 
161 161 161 161 
x — | 161 161 161 161 
=] 161 161 161 161 
161 161 161 161 
280.00 200.00 113.14 56.57 
_ | -120.00 —40.00 113.14 —56.57 
(c) Hx= 0 0 0 0 
—56.57 56.57 0  —80.00 
280.00 200.00 113.14 56.57 160 0 00 
_ | -120.00 —40.00 113.14 —5657| ~_ | 160 0 00 
clipped Hx = 0 0 0 0 |*=]160 160 00 
56.57 56.57 0 80.00 160 160 160 0 
379.75 141.25 38.54 5.66 
_ | 139.25 69.75 78.84 74.95 
(d) Hx=| 9334 50.91 —99.00 44.00 
89.45 —12.37 —44.50 —57.00 
379.75 —141.25 38.54 0 54 70 180 83 
_ | 139.25 69.75 78.84 74.95] ~_ | 183 1 238 229 
clipped Hx =| 9334 50.91 -99.00 4400|*=| 33 106 59 169 
89.45 —12.37 —44.50 —57.00 23 7 44 40 


2. co=c, =1,c; = 0 otherwise. 
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